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APPARATUS AND METHOD FOR STRUCTURE-BASED PREDICTION OF 

AMINO ACID SEQUENCES 

FIELD OF THE INVENTION 
5 The present invention relates to the field of genomics as well as to the field of 

protein engineering, in particular, an apparatus and method to determine the amino acid 
sequence space accessible to a given protein structure and to use this information (i) to 
align an amino acid sequence with a protein structure in order to identify a potential 
structural relationship between said sequence and structure, which is known in the art as 
10 fold recognition and (ii) to generate new amino acid sequences which are compatible 
with a protein structure, which is known in the art as protein design. 
BACKGROI JND OF THE TNVFNTTOM 

Within the field of bioinformatics there is an emerging need for so-called data 
mining tools. These tools aim at unraveling relationships that connect more elementary 
15 data components such as oligonucleotide and amino acid sequences, single-nucleotide 
polymorphisms (hereinafter referred as SNP), expression profiles, biochemical 
properties, structural and functional characterization, and so on. These relationships 
represent knowledge that can be used to build so-called knowledge databases that offer 
a structured description (e.g. via annotation mechanisms) of the more elementary data 

20 components such as above defined. Knowledge databases will increasingly play a key 
role in the biotechnology industry since they form an inference basis from which the 
behavior of biological systems can be understood, predicted and possibly controlled via 
proper biotechnological engineering. 

The need for data mining tools is further increased by the availability of rich, 

25 well annotated sequence databases which offer an in silico basis for new gene hunting 
approaches. Here the determination of sequence-to-structure relationships is playing a 
crucial role, also for those who are essentially not interested in structural information. 
Indeed, via sequence-to-structure relationships, often distant similarities with other 
proteins can be found. Such similarities are unlikely to be found by direct sequence 

30 comparisons if the sequence identity drops below the 25% level, according to 
R.F.Doolittle in Urfs and Orfs. A primer on how to analyze derived amino acid 
sequences (1986) at University Science Books, Mill Valey (California) and Sander et 
al. in Proteins (1991) 9:56-68. 
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Conversely, structure-to-sequence relationships are also rapidly 

gaining attention for several reasons. Firstly, the available experimental basis has 
recently evolved towards a mature stage, since the number of known protein structures 
has now reached over 12,000. Depending on the classification method, these structures 
5 represent about 600 to 700 unique folds, i.e. about 60 to 70% of the estimated total 
number of about 1000 possible folds adopted by naturally-occurring proteins, according 
to C Chothia in Nature (1992) 357:543-544. Secondly, the available technological basis 
has also made a giant leap since new methods for side-chain positioning (according to 
M Vasquez in Curr. Opin. Struct. Biol. (1996) 6:217-221), loop modeling (accenting to 
10 Sudarsanam et al. in Protein Sci. (1995) 4:1412-1420), inverse folding (according to 
Bowie et al. in Science (1991) 253:164-170), protein design (according to Dahiyat et al. 
in Science (1997) 278:82-87) and so on continuously appear in the literature, which 
methods also benefit from computer devices becoming more and more powerful. Third, 
the scientific community has started recognizing the potential of structure-based protein 
15 engineering as can be inferred from the growing number of companies in this field and 
the success of initiatives like the Critical Assessment of Techniques for Protein 
Structure Prediction (CASP) contests (http://PredictionCenter.llnl.gov/) and the 
structural genomics initiative (http://v^.nigms.nih.gov/funding/psi.html). 

Tools which are able to link a given amino acid sequence to the correct three- 
20 dimensional (hereinafter referred as 3D) fold or compute the amino acid sequences that 
are compatible with a given protein structure are very valuable. Given the myriad of 
world-wide known sequences and structures, distributed over different databases, some 
crucial parameters for determining the success of any such method are prediction 
accuracy and computational speed. 
25 The present invention relates to a novel approach to link one or more ammo ac.d - 

sequences with one or more protein 3D structures and vice ~na. More particularly, the 
invention concerns a new method preserving the accuracy of One most accurate atom- 
hased modeling techniques currently known while avoiding the bottleneck-problem 
known as the "combinatorial substitution problem", thereby gaining several orders of 

30 magnitude in computational speed. 

The fundamental problem to be addressed by the present invention relates to 
multiple combined substitutions in a protein and can be illustrated by the following 
example We shall consider a protein X for which the 3D structure has been 
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experimentally determined. This structure is called the wild-type (WT) structure and 
the amino acid sequence of protein X is called the WT sequence. We shall further 
assume that X is an enzyme used in an industrial process and that there is interest in 
constructing a modified protein X* showing a considerably higher thermal stability than 
5 the wild type X, while preserving the latter's functionality. While such problems have 
often been addressed by introducing point mutations (i.e. the substitution of single 
amino acid residues at selected positions), it has been shown that the simultaneous 
mutation of multiple amino acid residues is usually more efficient, according to 
International Application Publication No. WO 98/47089. On the other hand the latter 
10 approach, known as protein design, is extremely complicated in view of the extremely 
high number of a priori possible combinations of single residue mutations. For example 
if the aforementioned task of thermally stabilizing protein X was limited to substitutions 
in the core of a protein (i.e. the internal part of a protein where strictly buried residues 
are located) containing e.g. 15 residues, and if only 10 (preferably apolar) residue types 
15 were considered, there would still be 10 i5 a priori possible solutions. Each such solution 
represents an amino acid sequence which is different from that of protein X and is, in 
principle, a candidate sequence for protein X\ In order to predict which sequences are 
more stable than the WT sequence, one could apply a fast molecular modeling program, 
predict the 3D-structure of each sequence, compute the potential or free energy of each 
20 modeled structure and store the best results. However, clearly no modeling program is 
at present fast enough to complete this task in an reasonable period of time. 

The former example illustrates the main bottleneck problem in protein design, 
which is known in the art as the "combinatorial substitution problem" and which arises 
from the fact that different individual substitutions in a protein are, in principle, not 
25 independent as shown in FIG. la. The energy of a protein comprising a double 
substitution may not be simply calculated from the structures of two single mutants, 
because both mutated amino acid residues are "unaware" from each other in the single 
mutants, i.e. the single mutant structures lack any information about the interaction 
between both substituted amino acid residues in the double mutant. Consequently the 
30 simultaneous introduction of two or more substitutions into the protein to be modeled 
may lead to either favorable or unfavorable interactions which cannot be derived from 
the structure of single mutants. When performing such multiple substitutions, some 
information about the pair-wise interactions between the removed WT residues and the 
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demonstrated, namely by Gordon et al. in Fold. Des. (1999) 7:1089-1098, and 
Wemisch et al. In J. Mol. Biol. (2000) 301:713-736 that the combinatorial problem for 
multiple substitutions can be alleviated, by means of DEE, by eliminating a number of 
potential substitutions which cannot be part of the globally best solution, i.e. the most 
5 stable protein molecule. Since eliminations occur at the single or pair residue level, they 
result in a huge reduction of the search space since all combinations therewith are 
eliminated as well. Nevertheless, the DEE method is a valuable method to reduce the 
complexity of a system. The main disadvantages of the DEE method are (1) non- 
convergence towards a unique structure, as above-mentioned, (2) the inability to predict 
10 "second best" solutions and (3) a relatively weak performance for large systems (i.e. 
having more than about 30 residues), due to a mathematically stringent process which 
complicates the use of approximations. 

One problem to be addressed by the present invention is to develop a method 
rapidly and accurately predicting the compatibility of one or more amino acid sequences 
1 5 with one or more protein 3D-structures. The requirement of accuracy strongly pleads for 
the use of a detailed, atom-based force field (energy function) to compute the local and 
other interactions of a set of amino acid residues placed onto a 3D-structure. 
Furthermore, the above example showed that a correct positioning, using said 
interaction energies, is severely hampered by the combinatorial problem. On the other 
20 hand, all current methods which fully account for combinatorial effects are in conflict 
with the second requirement of a high computational speed. This constitutes the major 
dilemma in protein modeling, more specifically in fold recognition and protein design. 

In order to determine sequences that are reachable by a given protein structure 
and/or to identify the best-matching protein 3D-structure, given a particular sequence, 
25 one may use (1) prediction algorithms relying on statistically derived substitution 
matrices, (2) prediction algorithms based on statistical descriptors, and (3) computation 
of explicit conformational energies, corrected for solvation effects. 

In the approach using descriptors, a set of reference proteins (often referred to as 
the learning set) is used to derive properties that can statistically be used to judge the 
propensity for any given amino acid type to be present in some given structural 
environment. For example, in the original derivation of the 3D-profiles of Bowie et al. 
in Science (1991) 253:164-9, the environment of each residue is featured by (a) a buried 
side chain area, (b) a local secondary structure and (c) the polarity of the environment. 
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Using these features, the authors recognized 18 environmental classes. As a 
result, any given protein structure can be transformed into a linear string composed of 
the class identifiers for each position along the protein's sequence. From a database of 
proteins of known structure, a so-called 3D-to-lD scoring table is built to assign a score 
5 for finding each of the 20 naturally-occurring amino acid types associated with each of 
the environmental classes. Using this scoring table, the best sequence alignment can 
then be computed against the string of environment classes for a number of proteins. 

This procedure initiated the so-called threading approach in inverse folding 
where the compatibility of candidate sequences with a given scaffold is examined. As 
10 can be anticipated from the discretization procedure, wherein the complexity of a 3D 
environment is mapped onto a limited number of classes, the success of the threading 
approach is significantly limited according to MJ.Sippl in J.Comp.-Aided Mol. Design 
(1993) 7:473-501 and Chiu et al. in Protein Eng. (1998) 1 1 : 749-752. 

Global sequence signatures involving many residues may be detected but, in 
15 view of its statistical nature, the threading method will not be applicable to accurately 
assess the sequence-structure compatibility implying e.g. perturbation effects wherein a 
given WT sequence is varied at a number of positions. In addition, even if a sequence is 
correctly predicted to be compatible with a given structure, the threading method itself 
is incapable of deriving an accurate 3D-strucrure of the sequence laid over a given 
20 protein backbone structure. 

In order to achieve accurate predictions, it is desirable to take into account the 
full complexity of the 3D environment of each amino acid residue. In the third approach 
wherein predictions are based on computational methods assessing the energetic of a 
system of protein residues, it is possible to compute the conformation of lowest energy 
25 and even the sequence of lowest energy for a system consisting of a protein backbone 
and a set of amino acid residues which are allowed to vary in conformation and/or 
amino acid type. Most often, such computations are based on advanced search 
techniques combined with a detailed model for atomic interactions. In addition to the 
bonded and non-bonded interaction terms as used in molecular mechanics force fields 
30 (e.g. the CHARMM force field of Brooks et al. in J. Comput. Chem. (1983) 4: 1 87-217), 
these energy computations may also assess the hydrophobic effect through a suitable 
solvation potential such as e.g. the pair-wise solvation potential disclosed by Street et al. 
in Fold. Des. (1998) 3:253-258, thereby correcting for the modeling in absence of 
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explicit water molecules. 

In order to determine whether a sequence is compatible with a given protein 3D- 
strucrure or scaffold, one could in principle explore different alignments of the query 
sequence relative to the given protein backbone and search for alignments yielding low 
conformational energy, if any. However such a process suffers from several 
inconveniences. First, this process is very time consuming, especially if the given query 
sequence has to be scanned against a large database of protein structures, thereby 
searching for the best scaffold that is likely to match with the given query sequence. 
Indeed, for each protein a series of detailed computations need to be carried out at the 
atomic level: several alignments need to be explored and each time the conformation or 
sequence of lowest energy must be identified, which involves the solution of the so- 
called combinatorial problem whereby the best solution must be found out of all 
possible combinations of single-residue states. Consequently it is unlikely that such 
computations can be carried out in a high-throughput mode. Second, the alignments 
15 must allow for gaps which are often located in the loop regions. However for 
evolutionary distant proteins the precise gap locations and lengths are a priori unclear, 
which in turn necessitates multiple trial runs to match one sequence with one structure. 

In addition, some important problems simply cannot be answered by this third 
approach, such as for instance the task of computing all possible amino acid sequences 
20 that are compatible with a given protein scaffold (below a pre-set threshold, using a 
suitable scoring function), because it is at present impossible to generate all possible 
amino acid sequences and assess their compatibility within a reasonable computation 
time. 

As a summary, there is a stringent need for a fundamentally new method in 
25 order to explore the sequence space that is compatible with a given scaffold and to 
rapidly scan for structural compatibility a given query sequence against a database of 
know protein structures. 

GLOSSARY 

Throughout the foregoing and following description of the invention, the 
30 abbreviations, acronyms and technical terms used shall have the following meanings 
and definitions: 

ALA, ARG, ASN, ASP, CYS, GLN, GLU, GLY, HIS, ILE, LEU, LYS, MET, PHE, 
PRO, SER, THR, TRP, TYR, VAL: standard 3-letter code for the naturallyoccum'ng 
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amino acids (or residues) named, respectively: alanine, arginine, asparagine, 
aspartic acid, cysteine, glutamine, glutamic acid, glycine, histidine, isoleucine, leucine 
lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosme, 
valine. 

5 ASA: Accessible Surface Area 
DEE: Dead-End Elimination 
GMEC: Global Minimum Energy Conformation 
PDB: Protein Data Bank_ 
RMSD: Root-Mean-Square Deviation 

10 Accessible Surface Area (ASA) 

The surface area of (part of) a molecular structure, traced out by the mid-point of 
a probe sphere of radius -1.4 A rolling over the surface. 

Amino acid, amino acid residue 

Group of covalently bonded atoms for which the chemical formula can be written 

15 as +H3N-CHR-COO-, where R is the side group of atoms characterizmg the 

amino acid type. The central carbon atom in the main-chain portion of an amino 
acid is called the alpha-carbon (C a ). The first carbon atom of a side chain, 
covalently attached to the alpha-carbon, if any, is called the beta-carbon (C,). An 
amino acid residue is an amino acid comprised in a chain composed of multiple 

20 residues and which can be written as -HN-CHR-CO-. 

Backbone 

See: Main chain 
Conformational space 

The multidimensional space formed by considering (the combination of) all 
25 possible conformational degrees of freedom of a protein, where one degree of 

freedom usually refers to the torsional rotation around a single covalent bond. In a 
broader sense, it also includes possible sequence variation by extendmg the 
definition of a retainer to a particular amino acid type in a discrete conformafcon. 
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Dead-End Elimination (DEE) 

A method to reduce the conformational space of a side-chain modeling task by 
eliminating individual side-chain rotamers which are incompatible with the 
system on the basis of one or more rigorous mathematical criteria. By a variant of 
5 this method, pairs of rotamers belonging to different residues can also be 

eliminated. 
Energy function, force field 

The function, depending on the atomic positions and chemical bonds of a 
molecule, which allows to obtain a quantitative estimate of the latter's potential or 
1 0 free energy. 

Flexible side chain, rotatable side chain, modeled side chain 

A side chain modeled during performance of the method. 

Global energy 

The total energy of a global structure (e.g. the GMEC) which may be calculated in 
15 accordance with the applied energy function. 

Global minimum 

The globally optimal solution of a protein consisting of a number of elements (e.g. 
residue positions) which are a priori susceptible to variation (e.g. using the 
concept of rotamers). 

20 Global Minimum Energy Conformation, GMEC 

The global conformation of a protein which has the lowest possible global energy 
according to the applied energy function, rotamer library and template structure. 

Global structure, global conformation 

The molecular structure of a uniquely defined protein, e.g. represented as a set of 
25 cartesian co-ordinates for all atoms of the protein. 

Homology modeling 

The prediction of the molecular structure of a protein on the basis of a template 
structure. 
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Main chain, backbone 

The part of a protein consisting of one or more near-.inear chains of covalently 
bonded atoms, which can be written as (-NH-CHB-CO-V Herein, the C,a.om 
is considered as a pseudo-backbone atom and is included in the template because 
5 I Zic position is unamb.guously defined by the pos„,ons of the true backbone 

atoms. 

Protein Data Bank (PDB) 

The sin g >e intemationa, repository for the processing and distributi on o, r » 
macromolecular structure data primarily determined expenmentaHy by X-ray 
10 crystallography and nuclear magnetic resonance. 

Protein design 

That par, of mo.ecular modeHing which aims a, creating protein molecules w„h 
^improved propert.es like stabiUty, immunogenic!.,, solubihty and so on. 

Root-Mean-Square Deviation, RMSD 
15 The average quadratic dev^on between the atom* positions, stored in two sets 

of co-ordinates, for a collection of atoms. 

Rotamer, rotameric state 

An unambiguous terete structural description of an amino acid side cham 
1 11 preLed or no, According to this defmi.on, which is extended over 
Lai dluions by ordering less preferred states, it is not required that 
rotamers be interchangeable via rotation only. 
Rotamer library 

A , ab ,e or me comprising the necessary and sufficient data to define a collect-on 
of rotamers for different amino acid types. 
25 Rotamer/template (interaction) energy 

The sum of all atomic interactions, as calculated using the applied energy 
ZZ. -h, a gWen side-chain rotamer at a specie position m a protem 
structure and between the rotamer and the template. 



20 
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Sequence space 

The multidimensional space formed by considering the combination of all (or a 
set of) possible amino acid substitutions at all (or a set of) residue positions in a 
protein. 

5 Side chain, side group 

Group of covalently bonded atoms, attached to the main chain portion of an amino 
acid (residue). 

Side-chain/side-chain pair(wise) (interaction) energy 

The sum of all atomic interactions, as calculated using the applied energy 
10 function, between two side-chain rotamers at two specific positions in a protein 

structure. 

Steepest descent energy minimization 

Simple energy minimization method, belonging to the class of gradient methods, 
only requiring the evaluation, performed in accordance with a particular energy 
15 function, of the potential energy and the forces (gradient) of a molecular structure. 

Template 

All atoms in a modelled protein structure, except those of the rotatable side 
chains. 

SUMMARY OF THE TNVRNTTON 
20 The present invention provides a totally new approach to handle sequence-to- 

structure and structure-to-sequence relationships: 

(i) in a first aspect, the present invention comprises the systematical analysis of 
known protein structures in terms of individual contributions of single, pairs and 
occasionally also multiplets of amino acid residues to the global energy of a 

25 protein comprising any of these residues, and 

(ii) in a second aspect, the present invention comprises a method of combining the 
single, pair and multiplet contributions in order to estimate the global energy of 
a protein structure comprising a larger number of substitutions. 

A main feature of the present invention is that energetic values for single, pairs 
30 and multiplets of residues in different conformations can be pre-computed once and 
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stored on an appropria.e data storage device for aU, or a subset of al,, known 
protein 3D—-. These data can afterwards be reeved from the storage dev,c 
when needed for the ca.cniat.on of global energ.es. The same data elements can be used 
over and again in various experiments wherein the compatibility of different seouences 
5 with the same 3D-sm.cn.re is investigated. The present invention a,.ows to (.) produce 
data in a fas, and consistent way and (ii) use the produced data ,0 ca.cu.ate accurate 
g ,oba. energies of proteins, which have undergone a re,a«ive.y large number of 
mutations compared to the original structure. 

A central idea of the present invention is to analyze known proton structures by 
,0 computing a ouantitative measure reflecting Ore energetic compatibility of a,,, y 
occurring and synthetic amino acids of interest a. each residue pos.t.on of .nteres. .n the 
lure .n a Ii,r way, the energetic company of se.eeted residue parrs and 
muUiplets may be compute, as well. Upon generating such compa..b,..«y data, « .s 
possi .e to take into account conformational adaption effects for res-dues surround* 
,5 1 considered substituted residue(s), as if the same substitutions were adduced 
vitro by protein engineering techniques. 

The said energetic compatibi.ity data may be generated 
different protein structures and stored concisely using appropriate data storage devces 
.deally, ,. should be sufficient to use only pre-computed da. for single residues and*t 
residue pairs in order to derive the compatibly of any sconce with a g,ven 
structure with sufficient accuracy, .n such case, i, would become poss,b,e to denve 
tended lis. of structure-compatible seouences as we,, as se q uence-compa.,b.e 
structures arithmetical,, instead of by means of sophisticated protein modelmg. 

Tne present invention thus frs, provides a method for ana.yzing a protem structure 
25 the method being executab.e in a computer under the control of a program stored ,n 
computer, and comprising the following steps: 

A. receiving a reference structure for a protein, whereby 
. the said reference structure forms a representation of a 3D structure of the 

30 .LTprltconsists of a plurality of residue positions, each carrying a particular 

reference amino acid type in a specific reference conformation, and 

. -...^ified into a set of modeled residue positions and a set 

- the protein residues are classitieo imo <* w ^ 

of fixed residues, the latter being included into a fixed template; 
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B. substituting into the reference structure of step (A) a pattern, whereby 

- the said pattern consists of one or more of the modeled residue positions defined 
in step (A), each carrying a particular amino acid residue type in a fixed 
conformation, and 

- the one or more amino acid residue types of the said pattern are replacing the 
corresponding amino acid residue types present in the reference structure; 

C. optimizing the global conformation of the reference structure of step (A) being 
substituted by the pattern of step (B), whereby 

- a suitable protein structure optimization method based on a function allowing to 
assess the quality of a global protein structure, or any part thereof, is used in 
combination with a suitable conformational search method, and 

- the said structure optimization method is applied to all modeled residue positions 
defined in step (A) not being located at any of the pattern residue positions 
defined in step (B), and 

15 - the pattern and template residues are kept fixed; 

D. assessing the energetic compatibility of the pattern defined in step (B) within the 
context of the. reference structure defined in step (A) being structurally optimized 
in step (C) with respect to the said pattern, by way of comparing the global 
energy of the substituted and optimized protein structure with the global energy 

20 of the non-substituted reference structure; and 

E. storing a value reflecting the said energetic compatibility of the pattern together 
with information related to the structure of the pattern in the form of an energetic 
compatibility object. 

By "fixed" in step (A), it is meant that the template structure may be varied by at 
25 most about 1 Angstrom in RMSD compared to the original structure. 

For the purpose of the rotameric embodiment of the method, such as explained 
hereinafter, the selected rotamer library may be either backbone-dependent or 
backbone-independent. A backbone-dependent rotamer library allows different rotamers 
depending on the position of the residue in the backbone, e.g. certain leucine rotamers 
30 are allowed if the position is within an a helix whereas other leucine rotamers are 
allowed if the position is not in a a helix. A backbone- independent rotamer library uses 
all rotamers of an amino acid at every position, which is usually preferred when 
considering core residues, since flexibility in the core is important. However, backbone- 
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• » mnrP ^nensive and thus less preferred for 
sur face and boundary pos,t,ons. JJ^Hona. means, either on ,he 

r jsrr: si - - - — - - - - 

sysl Jdes an, —on tha, can be appreciated - * - 

as be,g closely re,a,ed to the op.in.nn,. For .he ^^ ZL^ - 
may be either explicit or implici., the latter bemg accomplished e.g. by 
10 ordering or by conventional compression methods. 

- ' *» ^ C °" 1 oftlein seance which is different from any 
method, for instance in the form of a protein q 

known protein sequence. method as 

. a nucleic acid sequence encoding a protein sequence analysed by the metho 



15 - 

above-defined 



iTe" ector or a host ce„ compristng the nuclcc acid sequence as above- 
defmed ' ■ ■ „ » th^raoeutically effective amount of a 

20 protein sequence generated oy 

pharmaceuUcally acceptable carrier. , Hm intaleriiK to said 

. a method of treating a disease ,n a mamma., compnsmg admnustenng 

mammal a pharmaceutic^ composition such as ^ „„ . 

. adaUtbasecomprisingasc.ofECO'sin.heformofada«sm.c«re,e.g.s, 

25 computer readable metom; ^ ^ ^ 

: i — :;::r:::::ving >l ~ . ~- 
-~^^=^— ---- 

30 described with reference .o the following detailed descr.pt.on. 
their conformations of lowest energy. 
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Figure 2 shows a schematic representation of the mechanism of pattern 
introduction (PI) and structure adaptation (SA) according to the present invention. 

Figure 3 illustrates the principle of additivity of adaptation (AA) according to 
the present invention. 

5 Figure 4 shows the result of the alignment procedure of human alpha- 

lactalbumin against a single residue GECO profile for hen lysozyme, according to the 
present invention. 

Figure 5 shows a correlation diagram between the energy obtained from the 
effective modeling of 100 different triple mutations at residues 5, 30 and 52 of the Bl 
10 domain of protein G and the energy calculated for the same substitutions derived on the 
basis of GECO energies, according to the present invention. 

Figure 6 shows a computer system suitable for use wth the present invention. 
DETAILED DESCRIPTION OF THE INVENTION 

As previously indicated, the present invention consists of two main aspects: first a 
15 new method including the ECO concept, secondly the use of such a method to assess the 
energy of global multiply substituted protein systems. 

The first aspect of the present invention will be explained namely with reference to 
figures 1 to 3 which are now described in detail. 

In Figure 1, the bold line symbolizes the backbone, whereas thin lines show 
20 residue side chains in their energetically optimal conformation. Residue positions are 
numbered from 1 to 5. Different residue types are represented by lines of different 
length. Black lines are used for the WT structure and grey lines show substituted 
residues. Dotted lines show residues before substitution or conformational adaptation. In 
Figure la, residues 1 and 5 have been substituted by larger residue types. While each 
25 single substitution would be compatible with the remainder of the structure, the double 
mutation leads to an incompatibility situation (symbolized by the dotted circle), thus 
both mutations cannot be modeled independently. Conversely, Figure lb shows a 
situation wherein two mutations are assumed to be independent: the single substitutions 
at residues 1 and 4, as well as the double mutation do not change their environment and 
30 both residues do not significantly interact with each other. Figure 1c shows that two 
residues which do not directly interact may still be coupled through conformational 
adaptation: the larger residue type at residue position 2 necessitates (dotted circle at left) 
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a corona, change (indicated by an^arrow) of residue 3 which is in conflict 
with residue 4 (dotted circle at right). 

— ts the protein of interest in its reference struck (ref). The middie drawmg 

position 1. The dotted circle iHustrates a sterical conflict w,th res,due 3. After the SA 
Z drawing,, residues 3, 4 and , have rearranged front their reference s» 

I dhnes) to a new, adapted conformation <fuU tines, which overcomes , sarf 
11, conflict. Tne energy of each of these structures is represented undersea * «h 
,0 Ling showing that the introduction of p into the reference structure firs, !ea* to a 
tZlv "intermedial structure which cn be rdaxed towards a strucmre of ower 
nfr^e difference between the energy of the reference structure, E- and tha, o 
I Itituted and adapted s,ru«ure, E(P ,, is the ECO energy, E t co<P>, w-uch can be 

considered as a substitution energy. 

fa Figure 3, the same notations are used as in Figure .. Figure a shows an 
example wh rein the AA principie is me, i.e. the same conformation* actons 
ylLd by Ote arrows, are observed for me doubie substitution (a. nght, - « 
taring for both single —ions (at left,. Figure 3b shows an example wherem the 

20 ^SnT *■ BCO concept and its posstble apphcations, the flowing 

2 a definition of conformation, freedom (see CONFORMATIONAL FREEDOM,, 

3 a description of app.icab.e energy models (see ENERGY FUNCTION,; 
25 4 a description of preferred modeling techniques which are abie to generate 

an ECO (see STRUCTURE OPTIMIZATION); and 
5. a definition of a reference structure, «s relevance and some preferred opttons 
(«e REFERENCE STRUCTURE). 

(seeKtrcREi descriptively (see 

Next, the ECO concept ts dtscussed m detatl, both P 
30 DESCRIPTION OF THE ECO CONCEPT) and mathemahcany (see FORMAL 
DESCRIPTION OF ECO).Then, a preferred approach to systematical* denve ECO s 
T aTe SYSTEMATICAL PATTERN ANALYSIS (SPA) MODE). Next, some 
" SI"-, ECO, are Caborated (see CROUPED ECO'S). Finany, a 
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preferred method to estimate the energy of multiply substituted protein structures 
from single and double ECO and/or GECO energies is described (see ASSESSING 
GLOBAL ENERGIES FROM SINGLE AND DOUBLE (G)ECO's). 

5 STRUCTURE DEFINITION 

The present method to generate ECO's is applicable to protein molecules for which 
a 3D representation (or model or structure) is available. More specifically, the atomic 
coordinates of at least part of a protein main chain, or backbone, must be accessible. 
Preferably a full main chain structure and, more preferably, a full protein structure is 
10 available. In the event that the available protein structure is defective, some 
conventional modeling tools (e.g. the so-called spare parts approach disclosed by 
Claessens et al. in Prot. Eng. (1989) 2:335-345 and implemented e.g. by Delhaise et al. 
in JMolGraph. (1984) 2:103-106, or other modeling techniques such as reviewed by 
Summers et al. in Methods in Enzymology (1991) 202:156-204 ?) may be applied to 
15 build missing atoms or loops. Depending on the force field used (see ENERGY 
MODEL), the coordinates of at least the heavy atoms (any atom other than hydrogen), 
or heavy atoms together with polar hydrogen atoms, or all atoms, of at least a region 
around the amino acid residues of interest must be known, where the said region should 
be sufficiently large to enable accurate energetic computations. Optionally ligands, such 
20 as co-factors, ions, water molecules and so on, or parts thereof may be included in the 
definition of the protein structure as well. A known structure for the side chains of a 
protein of interest is not an absolute requirement since the method of the present 
invention includes a side-chain placement step. However, knowing starting coordinates 
for the said side chains may be useful when the user should decide to energy-minimize 
25 the starting structure or to keep some side chains in a fixed conformation. 

The origin of the 3D protein structure is relatively unimportant within the context 
of the present invention since the ECO concept remains applicable to any 3D-structure 
conforming to aforementioned criteria. Yet, the meaningfulness of generated ECO's 
(and, therefore, the usefulness of applications derived therefrom) is directly proportional 
30 to the quality (accuracy) of the starting structure. Therefore, a number of possible 
sources of atomic coordinates, in decreasing order of preference, are provided 
hereinafter: X-ray determined structures, NMR-determined structures, homology 
modeled structures, structures resulting from de novo structure prediction. 

_0137147A2_L> 
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r *u nrPC< >nt invention, some atoms - generally 
In the preferred embodiment of the present invention, 

ECO" Preferably, a template consists of all main-chain atoms, C B -a,oms and l.gand 
A useful L more de.icate option is ,0 include papula, side chains ,„.o the 

The definition of a protein as used herein also 

extension, the method may also be applied to molecules 
mimetic structures and peptoids. 
I5 With respect ,o the second aspect of the present ™— ^ - » ^ 

me e„=rge.ic fitness of muhiply substituted protons on the s of 

20 

CONFORMATIONAL FREEDOM 

^ present method involves two "^J™^^ .Con. 
ac id suctions and coupon « aUowing conforn , ational 
The present method to generate ECO s can J ^ 

25 changes either (i) in a discrete rotameri c space or (.,) » a conunuo 
rotameric approach being preferred wil. be described first 

^e concept of — sta^s ~ ^J^^cL*-- 
chains) can adopt only a limned number of ener getcally 1W!77M91 . 

----- "r:^— 

30 The ~ as P e« of ^ ^ a ^ ^ of 

space of am -"°J ^tisofaro«amer 1 ibrary(forinsta„ce*eroumerlibnuy 
representee states on the has (1997) j-.53.66, comprising 859 elements 

described by De Maeyer et al. m Fold. & Des. (19V/) 
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for the 17 amino acids having rotatable side chains), thus considerably facilitating 
a thorough, systematical analysis of a protein structure. Furthermore, some energetic 
properties of individual rotamers within the context of a given protein structure can be 
pre-calculated before any modeling action takes place. The same can be done for 
5 pairwise interaction energies between rotamers of different residues. On the basis of 
individual (or "self 1 ) and mutual (or "pair") energies, a large number of rotamers and 
- rotamer combinations can be identified as being incompatible with the scaffold of the 
protein of interest. Another major advantage of the rotamer concept is that search 
methods are known in the art to find the energetically best possible global structure for 
10 the side chains of a protein, given a fixed main chain, a rotamer library and an energy 
function. Examples of such methods known in the art are the various embodiments of 
the DEE method such as disclosed by Desmet et al. in Nature (1992) 356:539-542, 
Lasters et al. in Protein Eng. (1993) 6:717-722, De Smet et al. in The Protein Folding 
problem and tertiary structure prediction (1994) 307-337, Goldstein in Biophys. Jour, 
15 (1994) 66:1335-1340 and De Maeyer et al. (cited supra). Irrespective of the search 
method, the use of rotamers allows to overcome local energetic barriers by simply 
switching between different rotameric states. 

Furthermore, it should be noted that ECO's refer to the structural compatibility of 
"patterns" that are grafted into a given protein structure. As used herein, a pattern 
20 consists, at the choice of the user, of one or more residue positions, each carrying an 
amino acid type in a certain conformation. Pattern residues need not be covalently 
linked to each other, but they must be intact (at least the whole side chain must be 
considered) and their 3D-structure must be unambiguous. In the preferred embodiment 
of the present invention, different ECO's may be generated, where each ECO is an 
25 energetic compatibility object defined on one or more residue positions, each having 
assigned a specific amino acid type and adopting a well-defined rotameric state. 

The selection of rotamers occurs on the basis of a rotamer library, for instance 
according to one of the three following possibilities: (i) the retrieval of side-chain 
rotamers from a backbone independent library, (ii) the retrieval of side-chain rotamers 
30 from a backbone dependent library and (iii) the retrieval of combined main-chain/side- 
chain rotamers from a backbone dependent library. The two former possibilities 
necessarily imply that the main-chain structure of a pattern residue is taken from the 
protein starting structure. The latter possibility requires an additional comparison 
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r in the orotein structure and all backbone 

between the local backbone conformation m the prote 

• u writhe retrieval of a side-chain rotamer associated with the 

conformations in the library and the retrieval 

r and is therefore more complicated since it 

closest matching backbone conformation and is therefor P 

, . n rt f the rotamer selected from me Horary 

involves the insertion of the main-chain portion of the rotam 

. ■ ct^-.rh.re This can be performed in many ways. For 
with the backbone of the protein structure, mis can o v 

-a « iti . nossible to apply a least-squares Fitting of the 
instance for isolated pattern residues, it is possioie xu <w ? 

CI - by - — dihedra, - *— 

Imer onto those of .he protein struck where the reamer , to ' F °< 
consecutive pattern residues (e. g . ,oops>, the method deseri b =d it, Bruge. (De.ha.se e. a, 
n " MO, Graph. (.984) 2,03,06) can he used. In bod. cases, the me.hod may be 
M Ld by an energy refinement step using a gradient method, e.g. a 
Irgy million. A common feature of the three possib.e se,ec.ion methods ,s . , 
lluit in a unicue definition of the 3D suture of each pattern rcs.due, therr SK.e- 
!1 —ions are seiected from a rotamer .ibrary whereas the, m,n.h,„ 
conformations are taken either from a Horary or from the protein surt.ngstn.cmre. 

Another aspect of the present .nvennon concerns the confo_, freedom^ 
th at par, of the protein no. be.onging to the pattern of .merest, ,.e. the pa«em 
!; rronmenf. h. this respect, the present method inc.udes a step wherem the pafem 

0 a. an energeticai optimization of intramo.ecu.ar in.eract.ons w,th,n me panern 
e vl— and between the iatter and the pattern atoms. Thus, we sha,. ma e a 
Linction between a rotameric and a — embodiments, bo. s^ 
present invention, since they have ouite different character,^. In the pre^d 
Imenc embodiment, the acccssibie conformation* freedom of the «em 
25 environment is determined by a rotamer .ibrary. Simi.ar. y to pattern ^ re,dues, * 
rotamer i.brary used for the non-pattem residues may be e,ther a s, o 
combined main-chain/side-chain rotamer .ibrary. However, restnenng the 
If— i space of the non-pattern residues to on.y side-chain rotamers thus 

30 m contrast ,0 the assign, of one uniaue — c conformatmn to each of * 

pattern residues, the residues of the pattern environment have « ^ ^ 
rotameric states known in the library for the associated res.due types. The touU 
To—, space is men defined as a„ possib.e combinations of rotamers for the 
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non-pattern residues. The search method described in the section STRUCTURE 
OPTIMIZATION below then needs to find the energetically most favorable 
combination. 

As will be appreciated by those skilled in the art, several techniques can be applied 
5 to restrict the number of rotamers for individual residues in a protein structure, based on 
criteria related to local secondary structure, local energy, solvent accessibility and so on. 
In addition, some residues of the pattern environment may be completely refrained from 
adaptation (this is equivalent to include such residues into the template, see 
STRUCTURE DEFINITION) in order to speed up computations. This however may 

10 lead to underestimate the structural compatibility of the pattern of interest. The user 
therefore has to decide whether the induced level of error remains acceptable in the 
application concerned. For example in a fold recognition application, where the 
compatibility of a long amino acid sequence with different protein folds is to be 
assessed, many single residues may be erroneously predicted as incompatible if the 

15 global score for a correct or "true" sequence-structure correlation remains 
distinguishable from the incorrect or "false" matches. It may even be preferred to 
generate ECO's while limiting adaptation to only those residues in immediate contact 
with the pattern residue(s). To the contrary, should a user be interested to test whether a 
given pattern (e.g. the binding site of a receptor, or any part thereof) can be designed 

20 into a scaffold of interest, then it is advisable to eliminate as many sources of error as 
possible and therefore to retain a high number of adaptive residues around the said 
pattern. As a general rule, it is suggested to apply one of many possible threshold-based 
methods directly or indirectly related to the distance between pattern and environment 
residues. For example a convenient criterion may be to include in the set of adaptive 

25 residues all residues j for which 

min, D(y ) < S + n{f) + n(f) (eq. 1 ) 

where i is a pattern residue,./ is an environment residue, D(Jj) is preferably equal to the 
distance between the Cp-atoms of / and j in the protein structure, min, D(iJ) is the 
closest such distance for all pattern residues /, S is a predefined threshold value 
30 (preferably about 6 A but possibly lower or higher values for less or more critical 
applications,, respectively), f and / are the residue types at positions / and j\ 
respectively, and n(f) and n(j > ) are measures of the size of the side-chains at the 
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respective positions I and j. where . is preferably equal to the number of heavy 
atoms (i.= .11 non-hydrogen atoms) counted from the C s -a,om (included) along the 
lo „ges, branch in the side chain. For example, if f*A* -* * " ^ * 
included in the set of adaptive residues if DOT) < 6+3+5 - 14A. 

Although less preferred, the present method to generate EDO's can also be 
performed without using the rotamer concept. Again a distinction needs to be made 
-between the pattern residues and the pattern environment. With respect to the 
conformation of the one or more pattern residues, a first approach is to "copy" a sub- 
structure from a known 3D-structure and to graft i, into the structure of the protem of 
interest using a suitable structure fitting method This can be of interest, for example, » 
loop grafting or when attempting to substitute binding site patterns. This M» matches 
the ECO concept since the definition of a pattern requires establishing a well-defined 
se, of residue positions (e.g. a loop fragment), carrying a well defined amino acd 
sequence (one residue type a, each position) in a well-defined conformafon (e.g. a 
known conformation). An alternative to assign a conformation to pattern residues 
randomly sample i, from a continuous torsion space around main-chain and/or s.de- 
chain dihedral angles using a random number generator. This approach does form a 
feasible alternative to investigate the compatibility of different patterns occupymg the 
same residue positions and types while adopting different non-rotameric conformafon. 

The non-pattern residues form the subject of the second step of Ore present method 
i e the adaptation step. Here also, a less preferred but feasible non-rotamerie approach 
is 'possMe. The structura. adaptation of the pattern environment in a contmuous 
conformational space will be driven by the selected optimization method, typrcally a 
gradient method, in combination with the selected energy function, typicaUy a potenfa 
S or free energy func,ion.Here, the accessible space is in princip.e nearly infinite smce ,. 
includes all possible combinations of continuous torsion variations with bond angle 
bending, bond stretching, and so on. However, in practice, most contmuous 
optimization methods (such as e.g. steepest descent energy minimization) do cause only 
,oca, changes in the structure and are prone to ge, trapped into ,oca. optima instead o, 
i0 the global optimum. On the other hand, gradua. non-rotameric optimization meBtods 
advantageously provide a simple solution to the problem of main-chain flex.btl.ty 
because during structural optimization of a pattern environment, the protein mam-cha,n 
can be optimized together with the side chains. 
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In conclusion, both a rotameric and non-rotameric method are suitable for both 
pattern definition and adaptation of the pattern environment. Although all combinations 
are possible and useful in particular circumstances, particularly preferred are the 
combination of rotameric patterns with non-rotameric adaptation as well as 
5 combinations of rotameric adaptation with rotameric or non-rotameric patterns of 
interest. 

ENERGY FUNCTION 

the most important element of an ECO is its energy value, which should be 

10 interpretable as a quantitative measure of the structural compatibility of the ECO pattern 
within the context of the adapted protein structure. For this reason, the notion of 
"structural compatibility" may be considered equivalent to "energetic compatibility". 

In principle, the method of the present invention may be performed with all energy 
functions commonly used in the field of protein structure prediction, provided that the 

1 5 applied energy model allows to assign an energetic value that is a pure function of the 
atomic coordinates, atom types and chemical bonds in a protein structure, or any part 
thereof. When considering rotamers it is preferred, although not required, that the 
energy fiinction be pair-wise decomposable, i.e. the total energy of a protein structure 
can be written as a sum of single and pair-residue contributions (see equation 4 below), 

20 in order to facilitate the adaptation step in the generation of ECO's. 

Preferably, the applied energy model should approximate as close as possible the 
true thermodynamic free energy of a molecular structure. Therefore, the applied energy 
fiinction should include as many relevant energetic contributions as possible, depending 
on the accuracy requirements of the user. Highly preferred energy functions include 

25 terms which account for at least Van der Waals interactions (including a Lennard-Jones 
potential such as disclosed by Mayo et al. in J. Prot. Chem. (1990)), hydrogen bond 
formation (such as referred by Stickle et al. in J. Mol Biol. (1992) 226:1143), 
electrostatic interactions and contributions related to chemical bonds (bond stretching, 
angle bending, torsions, planarity deviations), i.e. the conventional potential energy 

30 terms. Useful energy models including such contributions are, for example, the 
CHARMM force field of Brooks et al. (cited supra) or the AMBER force field 
disclosed by Weiner et al. in J. Am. Chem, Soc. (1984) 106:765. 

In addition to these potential energy terms, some other energetic contributions may 

0137147A2_L> 



WO 01/37147 



PCT/EP00/10923 



10 



b e — , given .he fact .ha, difieren. ECO's may invo.ve differen. residue rypes 
-d chemica, nature, placed a. ****** — pos,cns ma 
p o Llture. Four W es of energetic Condons and co.ec.ions may account ft 
le- and topology-dependen, eff.cs: corrections for (.) mam-cham fl ex,b,h,y, 2) 
ZZ effect, intia s,de-chai„ con— s and (4) me behavior of p*«* 

„e correc.ed by reducng .he sensitive of .he .erms related ,o pacing tffc * e. * 
using memods including uni.ed-a.om parame.ers such as d.sclosed by Lazar et al. 
7 Sci (.997) 61167-1178 or re-scaling of a.omic radii such as proposed by 
I .! a L ^ - (W97, 94: ,0I«-10,77. When generating 

ECO s using a g.dien, memod in me adaption s.ep, main-chain vanaml.ty 
!u.l.Ly In i„.o accoun. by induding corresponding a.oms ,n .he 

. ^trr:: hydr OP ho bi c — «~ - be . 

' de— g ,e reliabiuty of modeled pro.ei„ structures. According .0 .he P^sen 

"d 1 accuracy is a method where, ,he burial and ensure of po ar and non po^ 

„ 1ms is translated in.o energetic ^^J^^T^ 
buried and exposed surface areas and of a set of atom «yp P 
oaramcers such as reviewed by Cordon e. a!, in On-. Op,n. On* M 

ge „era.ion occurs in a rotameric way, i. is benefit .o have a pa,r-w,se 

^ k» Street et al (cited supra). Also preferred tor us 
25 solven. ft.nc.ion, such as proposed by Srree, al. (c. P 

terms rf ^ respective rotamers a , Ore 

acids) dependmg on the degree of solv p 
considered residue positions in the pr o.em sm.cn.re of m eres.. Th,m 
30 referred .0 as the type-dependent, ,opo,ogy-specif,c solvation (TTS 

mod a s=. of N TYPES energetic parame.ers is es«ablished for each of NCLASSES 
melhod,aselo.N_ii t. M typfS is me number of residue types 

differen. classes of solven. exposure, where NJTYPES .s „„ fs0 , ve „, 
considered (usually 20) and N.CLASSES is .he number of discrete Casses of solven. 
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exposure. For example, three such classes may be established, corresponding with 
buried, semi-buried and solvent-exposed rotamers, respectively. It is noteworthy that 
different rotamers for the same residue type at the same position may be assigned 
different classes. Both the number of classes and their boundaries may be subject to 
5 variation, but the following embodiment should be preferred: 

- first, each pattern element (a given residue type at a given residue position in a 
specific rotameric state) is substituted into the protein structure and its accessible 
surface area (ASA) is calculated, for instance according to Chothia in Nature (1974) 
248:338-339, 

10 - next, a class assignment occurs on the basis of the percentage ASA of the residue 
side chain in the protein structure compared to the maximal ASA (over all rotamers) 
of the same side chain being shielded from the solvent only by its own main-chain 
atoms, for instance the class of buried rotamers may be defined by ASA% < 5%, the 
class of semi-buried rotamers by 5% < ASA% < 25% and the class of solvent- 

15 exposed rotamers by 25% < ASA%, 

- and finally an appropriate type- and topology-specific energy, Errs(T,C), for the 
considered pattern element is retrieved from a table of TTS values, using the pattern 
type (T) and class (C) indices. 

The TTS energy thus obtained should be added as an extra term to the energy of the 
20 pattern element in the protein structure (see FORMAL DESCRIPTION OF ECO'S). 

A table of TTS values may be established by tuning the different TTS parameters 
Etts(T,C) versus a heuristic scoring function S as follows. The tuning method includes: 

- first considering a set of a number of, preferably at least 50, well-resolved protein 
structures and assigning to each residue position of this set a unique index / (1 < / < 

25 all residues), 

- then considering, at each residue position /, all residue types a in all rotameric states 
r, thereby forming the set of all possible position-type-rotamer combinations z r fl , 

- defining by z r w the WT residue type at position i in a rotameric state r as observed in 
the protein comprising position /, 

30 - assigning to each if a solvent exposure class C(/" ) as described above, 

- calculating for each if the value E^if) as the total potential energy the rotamer if 
experiences within the fixed context of the protein structure comprising z, preferably 
using conventional potential energy terms such as disclosed above, 
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. defining a function AE(i) as 

ABM - min, (WO + - "»n. ™n, 0W*) + E TO (o,C(0» («H> 

tfW) being intertable as the difference in energy of the WT reside type in 
it s best possible rotamer (as obtained by the min, operator) and the 
energetically most favorab.e residue type at residue position I also in .«s best 
possible rotamer (as obtained by both the min, operator and the mm, operator), 
given a (as ye, undefined) se, ofTTS parameters, and 
. applying a suitable parameter optimization method ,0 max.rn.ze S by varymg 

^'"llrs are zero, .en .nation (2) provides the difference in 
potell energy between the optima, WT rotamer and the energetically most favorab.e 
^r^Tafposition , m its op..ma, rotameric state, wthin the context of the pro em 
s^ltvolved. The effect of assigning non-zero values ,0 the TTS parameters .s ha 
ZZy of a residue win be modified by its associated type- and topology-depen en 

the observed WT residue becomes or approximates the energeUcalty optima, res,due 
type. ,n order to reach this objective, the objective scoring function S defined by 

following equation (3) 

(eq.3) 

S-&(1+4E(0)'' 

may be used, which means that if a, position f the WT residue type is best 1^*0 = 

in energy man the best possiMe type, this wou.d result in an mcrease of S by 0.5 urn,. 
3>y, ., has been observed tha, the function S as defined by ea.uat.on p, o 60 

and be,ow the current value of each parameter and accepting on.y up-h, ,1 « - 
hetter va.ues) in S. Initial parameters may be set to zero and nut.al steps are prefer Wy 
Z 0 J, mol '. gradually reducing in size by about 75% per opt— r^ 
u»... a step size of about 0.05 iccal mol" is reached, «h,ch termmates me parameter 

OP Tir^ of parameter tuning may ,ead to even more relevant results if ore 
30 The same strategy of p ^ ^ ^ 

potential energy calculations, Epofr). <" e P enorm 
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protein structure rather than a fixed structure. This can be accomplished e.g. 
by using one of the structure optimization methods described in the next section. 
Moreover, such approach is fully consistent with the ECO concept itself, wherein the 
compatibility of patterns is evaluated in an adaptive environment. A set of TTS 
5 contributions derived in this way is shown in table 3. 

A most attractive feature of the TTS method is that terms used therein will 
automatically account for two other effects which may be critical in the field of protein 
design. The first effect is the mutual bias of different residue types when comparing 
their intraside-chain energies. For example, an arginine residue may have an energetic 
10 advantage over a chemically similar lysine residue by more than 5 kcal mol" 1 (using the 
CHARMM force field) which is solely due to intraside-chain contributions. In order to 
correct for this phenomenon in the prior art, such contributions may be simply switched 
off or corrected for by subtracting a reference energy calculated as the potential energy 
of an isolated amino acid residue, according to Wodak (cited supra). However, the 
15 present invention prefers not to impose any a priori intraside-chain correction, but to 
incorporate this into the TTS contributions. The second effect relates to energetic 
aspects of the unfolded state, which is ignored when computing energies exclusively 
from the folded state of the protein. This involves complicated elements like changes in 
entropy, residual secondary and tertiary structure in the unfolded state, the existence of 
20 transient local and global states, and so on. Consequently, in a modeling environment 
where native, folded structures are examined, it is extremely difficult to predict the 
impact of such effects on the predictions. However, in as far as such effects would be 
type- and topology-dependent, it may be assumed that the TTS method implicitly takes 
them into account. For example, buried leucine residues may experience a greater loss 
25 in torsional entropy compared to isoleucine upon folding of the protein in view of the 
fact that the latter is beta-branched. 

In conclusion, the present method for generating ECO's requires a suitable energy 
function including as many relevant contributions as possible, preferably functions 
based on inter-atomic distances and bond properties. There is no specific preference 
30 with respect to explicit treatment of hydrogen atoms or not. The implementation of the 
aforementioned TTS method will be useful when ECO's are prepared for fold 
recognition or protein design, as well as for other applications. 
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„ e. a seal method producing an optima, - nea riy optima, 3D suture f* 

I given the starting structure of the protem as previous* ^ - 
S SUTURE DEPmmON, the —.ion of one or more subs— accordm ,o 
interest (except for .he reference structure .escribed in the sectron 
LSnCE STRUCTURE beiow), the ahowed conformation* vanabon of the 
as previous, described ,n COKTO^ONA, FREEDOM and the 

are nlffoduced in the context of a protein structure, i.e. an ECO energy is *"^ a ^ ^ 
d ff la, vaiue between the tota. energy of a subst.tu.ed protein suture and that .fa 
TZ reference" protein structure. Both the substituted and the reference structures 

15 ECO energy as a measure of the true substitution energy associated w.ththe P «ern of 
Z Wore the method shou,d guarantee that both the substituted and Ore no- 
2Ld protein snares approximate as Cose as possib.e me structures as may be 

":r"L,nt backbone —on may be a source of error for 
20 th e r— — ns: proteins comprise — - 
•~ th P ir backbone conformation or may even completely unioia. 

bul shou.d no, change me conciusion mat such substitute are probably mcompa 
with the protein backbone. 

25 .he present invention h, no preference with ^^^^Z 

for search) method used as long as « provtdes a benef.cal ratto P 
l : Icy versus computation, speed. Suitable ta own search me.no s mclu^ 
example, Monte Cario simuiation such as disc.osed by Ho,m , et^ m ^ 
„992> 14-213-223, genetic algorithms such as disclosed by Tuffery e. a.. » 
( Z1L«. ,-0, 98 , mean,e,d approach, suet . . «-£ 

Koehi et al in J. Mol Biol. (1994) 239:249-275, restricted combmatonal analysts 
: Closed by Dunbrack - - * * **■ «* C»93> 230:543,74, baste 
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molecular mechanics gradient methods such as described in Stoer et aL 
in Introduction to Numerical Analysis (1980), Springer Verlag, New York and DEE 
methods such as previously disclosed. In the preferred embodiment of the present 
invention wherein the conformational freedom of individual residues is represented 
by rotamers, a yet unknown structure optimization method is however preferred 
over the previously known methods because it combines a high accuracy together 
with a high intrinsic computational efficience. This combined characteristics is 
important especially in the systematical pattern analysis (SPA) embodiment of this 
invention wherein typically hundreds of thousands of patterns per protein need to 
be analyzed. This method, hereinafter referred as the FASTER method, is a method 
for generating information related to the molecular structure of a biomolecule (e.g. 
a protein), the method being executable by a computer under the control of a 
program stored in the computer, and comprising the steps of: 

(a) receiving a 3D representation of the molecular structure of said biomolecule, 
the said representation comprising a first set of residue portions and a template; 

(b) modifying the representation of step (a) by at least one optimization cycle; 
characterized in that each optimization cycle comprises the steps of: 

(bl) perturbing a first representation of the molecular structure by modifying 

the structure of one or more of the first set of residue portions; 
(b2) relaxing the perturbed representation by optimizing the structure of one 
or more of the non-perturbed residue portions of the first set with respect 
to the one or more perturbed residue portions; 
(b3) evaluating the perturbed and relaxed representation of the 
molecular structure by using an energetic cost function and 
replacing the first representation by the perturbed and relaxed 
representation if the latter's global energy is more optimal than that 
of the first representation; and 

(c) terminating the optimization process according to step (b) when a 
predetermined termination criterion is reached; and 

(d) outputting to a storage medium or to a consecutive method a data structure 
comprising information extracted from step (b). 

In short, this method works by repeated, gradual optimization of rotameric sets of 
residues. When the protein system is near its globally optimal structure, which is the 
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, eH one or two residues) are substituted into a reference 
structure, only a small number o p ^ ^ w ^ m 

relax ed structure. This process typ.c Uy ^ ^ ^ ^ MHz R12000 processor as a 
seconds CPU tune per pa—, ~ ^ ^ about l00 

5 reference). The computauon of the reference stru 
to 1000 CPU-seconds per protein structure. 

REFERENCE STRUCTURE ( for ^ pro , ein 

The reference structure gcneraKd starting 

10 of interest, given an expenment* detenn mbodinient of present 

„, . par.icu.ar rotamer .ibrary (oruy .n*e ~ ^ s6arch 

invention) and a se.ec.ed energy funCon and may be churned by 
nretnod (see STRUCTURE OPTIMIZATION). ^ 
The ro.e of .he reference structure ,s to ^ . 

' 5 in step P) interpretab, as a subs^on , are ^ ^ ^ 

» instance, the proton o mteres may ^ ^ ^ ^ rf ^ 

l0 „es« energy by supp>y.ng the ^ ftom , he obser ved 
method may define a reference sconce ^ ^ „ , he 

in lh e sprung structure of the £~ ^ ^ „ „ is to *e 
comp,e,e amino acid se q uence may be ^ ^ M 

25 conformation of .owest energy^ * £~ ^ ^ ^ ^ 

30 secures that be generated in an ECO b,e app^ ^ 
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the structure of lowest energy for the sequence of said enzyme (e.g. using the 
homologous backbone structure) rather than for the sequence of the homologous protein 
(using its own backbone structure). If the mutants of the said enzyme were built from 
ECO's generated for the homologous structure, this would involve many more 
5 individual changes (and potential sources of error) as compared to when the enzyme 
mutants are built with ECO's generated with the modeled enzyme sequence as a 
reference structure. 

However protein design and fold recognition applications may also rely on some 
general purpose databases of ECO's derived from representative protein structures or 
10 scaffolds. Furthermore, it is not an absolute requirement to use any known or existing 
reference amino acid sequence when generating ECO's. A most interesting type of 
reference sequence may be the sequence leading to the lowest possible energy for the 
protein of interest, in other words the global minimum energy sequence which may be 
computed by any suitable search method such as previously described (see 

15 STRUCTURE OPTIMIZATION), e.g. a DEE method or the FASTER method. Some 
other reference sequences, such as a polyglycine or a polyalanine sequence, may 
provide practically useful reference structures. They provide the advantage that the 
adaptation step in the ECO computation is easier, but the disadvantage that the ECO 
patterns are modeled onto an almost naked backbone. Finally, we observed 

20 experimentally that assessing multiply substituted proteins from single and double 
ECO's is surprisingly insensitive to the nature of the reference sequence, especially in 
applications which tolerate some level of approximation (like fold recognition). 
Therefore any possible reference sequence, including randomly generated and dummy 
sequences, may be practically useful in particular cases. Selecting a reference sequence 

25 forms an essential part of generating an ECO since it directly relates to the interpretation 
of the associated energy value as being a substitution energy. 



DESCRIPTION OF THE ECO CONCEPT 

An "energetic compatibility object" (ECO) is an object which basically provides a 
30 fully structured description of the results of a modeling experiment wherein a given 
pattern p of interest has been introduced and modeled into a protein reference structure. 
Usually the pattern/? consists of a small number of amino acid residue types, each being 
placed at a specific residue position in the reference structure and adopting a well- 
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defi „ed fixed con—n m a prefer — — of .he presen, mve„«o 
Linalter referred as the "sys.e,na,ical pane™ analysis" (SPA) mode, a large num 
predefined patterns may be general and - analyzed whi.hin U>e context of , e 

flee «JL ^ SPA .ode is particularly sui,ed .0 ob,ain 
concerning .he nubility of single and pairs of residue, It a,so prov.de* .he necessary 
:;lla for the anaiys.s of multiply — d pro.eins on .he basis of e,emen,ary 

"TSTof res-dues of, is usually , or , ha. larger patterns of .ntere, 
com^i g e.g. a se. of clustered core residues, may he considered as we,,. In genera,, 
ZsenHnLiondoesno.imposeresnic.ion ^ 
. si ce by def.ni.ion , is a pattern of interest .„ the user. However, ,n .he pref rred SPA 
lie, Ire ECO's are systematically general for many different patents, the 
number of residues of any such pattern may preferably be limited .o 1 or 2. 

Besides .he number of residues, an unambiguous desenpfon of p also recnures 
J^Z - -c. ,oca,i„ns of *e one or more residues of p wHhin ft. framewo* 
eflnce structure. In the presen, invennon me allowed loca.ions c„ 
1 amino acid residue posinons, which means ,a. ,e said ,oca,ions are ^ 
selecting for each pattern residue, a specific amino acid residue pos.fon of .uteres, All 

0 Id P os,,ions, irrespecive of wherner .hey are chemicaUy modified or cros^m* £ 
outer residues. Conversely, a.oms or molecules no, belonging to a protem or pep,.de 
ZJ< 1 «he ar. as ligands, are no. considered as valid positions. In ,he case of pattern 

aDoreciated by those in the art, it may be aesirau 
" I lidue positions are Parnate in space since very distant residues are 

independent. However Utis is preferred on,y when (i) a large number of d.ff - 
atls has , he examined and compu,,iona, speed is a lirnittng -«£ 
c„ndi,ion of independence and me re,a,ed cri.erion of p— have *- - - J 
■r A ,n ,he preferred SPA mode of implementation of the mvenfon, .he res,due 
30 ^ ^nn e^Iy be de,ermined * the region -n pro.e.n reference *~ 
12 s of interest .0 .he use, If *e in.eres, is e.g. in ***** a pro,e.„, 
I dues may be taKen in.o account and a se. of patterns may he sys.emattcally defined 
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for all single core residues as well as all possible pair-wise combinations thereof. 

A further aspect of the present invention is the chemical nature of the one or more 
residues of p which may be validly considered at the selected positions in the reference 
structure. According to the invention, all naturally-occurring and synthetic amino acid 
5 types may be included in the set of valid residue types. The choice of these residue 
types is basically unrelated to the amino acid sequence of the reference structure. If a 
residue type of p differs from the residue type in the reference structure at the selected 
position, this actually corresponds to a mutation. In such case, the original residue type 
needs to be removed from the reference structure and replaced by the pattern residue 

10 type. Each replacement step is preferably done by respecting the rules of standard 
geometry (standard bond lengths and angles and default torsion angles). In the preferred 
SPA mode of implementation of the invention, the set of single residue patterns may 
consist of all valid residue types as defined above, placed at each residue position of the 
region of interest. The set of pair residue patterns may consist of all possible pair-wise 

15 combinations thereof. 

Fully consistent with the present invention is any method used in step B in order to 
further restrict the number of valid residue types at each position in the region of 
interest. Such an embodiment may be used (1) when the available computational 
resources necessitate a restriction, or (2) in cases which are interrelated with the 
20 application of ECO's. Ignoring the first case (thus assuming infinite computational 
resources), it will be clear to those skilled in the art that some residue types may be 
discarded from consideration depending on the position in the reference structure. 
Examples of such restriction methods include, without limitation: 

- the exclusion of synthetic residue types, 

25 - the exclusion of proline (actually an imino acid) residues depending on the local 
main-chain conformation, 

- the restriction of core residues to non-polar types, 

- the exclusion of residue types with low helical propensity in alpha-helices, 

- the preservation of the chemical nature (e.g. aromatic) of residue types in the 
30 reference structure, and so on. 

So far, a pattern p is defined as one or more residue types placed into the reference 
structure at well defined residue positions. In order to unambiguously define p, the 
conformation of each /7-residue must be further specified by the assignment of one 
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specie conformation .0 each residue of p, prefer* by selecting a specf.c rotiune 
L a reamer hbrary as desert hereinabove. Alternatively, in .he no,_ 
embodUnen, of the presen, invention, the said conformational ass.gnmen. may re uU 
from omer sources including, bu, no, limited .o, experiment de.erm.ned pro.e,n 
sutures, compu.er-genera.ed conformations or, as described below, a groupmg 
p,ocess applied t o differen, ECOV The presen. invention does no. impose resrncn- 
- o .he naL and ongin of me conforma.ion of me p-residue, h me preferred SPA 
lot of imp— n of ,e presen. invent a,, (combinations of, _ « 
applied .o me p-residue, For patterns consisting of residue pa,s, .h, .eads .o a s«- 
Ilna, space, .n order * Keep me "volume" of .his hyper-space within reasonaUe 
ZZ, a vanly of simple and more comp.ica.ed cri«eria may be es— ,„ order o 
luce me number of patterns for residue pairs. However, such cri.ena are e m 
,„e presen. .nven.,on since .he resuUs of any reducion method necessanly leads .„ a 
of patterns of in.cres., which is me subjec. of .he present inven«.o„. _ 

Tue most important property of an ECO is i,s energy vaiue wh.ch ,s defined * , *. 
difference between the gioba, energy of the reference structure and the global energy^ 
tlsame reference structure after its modification by .he s.eps of (1) .ntroducmg «h 
r:Lm .to the reference structure and « computing .be g^Hy 
conformational rearrangemen, of ,he reference structure as illustrated ,n . ^ . ^0 
> a „d (2) win be referred .o as me pattern —ion (PI) step and the s,ruc<ural 
^ rsAl s.ep restively. Computing .he ECO energy value .s therefore 

iobllergy of the subs,ib,ed and adap«d stiucture is calcu,a,ed and fitrmer wherem 
me global energy o,«he reference stiucture is subtracted from this van,. 
, 5 The PI step involves me necessary substitutions and the generat.cn of 
5 Jll of the p-residues, as diced by me precise definition of .he panern , 
This is preferably accomplished by a conventional protein modeling package such 
r lll by OelLise e, al. (Ced ,p~> compn,ng practical ,oo,s .n order «o perform 
amino acid substitutions and conformational manipulations. Once the pattern p 
30 loduced, .he amino acid type(s) and conformations) of the subs,„u.ed res.du* » 
filen mr ughou. the process leading to .he ECO energy value compuu on. The P 
" can be seen as a specific perti.rba.ion of ,e reference suture, leavmg *. 
structure in a sub-optima, — . Indeed, the introduced res.d„es ,n ,he,r fixed 
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conformation may very well be in steric conflict with surrounding amino acids. 

In the SA step, the latter are given a chance to rearrange and find a new relaxed 
conformation in which steric constraints are alleviated and intramolecular interactions 
are improved. According to the present invention, this rearrangement is energy-based, 
5 i.e. the cost function to be optimized during the SA step is the energy of the global 
structure as defined in the section ENERGY FUNCTION. In a preferred embodiment of 
the invention, especially the rotameric embodiment described in the section 
CONFORMATIONAL FREEDOM, all atoms of the pattern residues and the main- 
chain of the reference structure are held fixed throughout the SA step. In a less preferred 

10 embodiment, the pattern atoms remain fixed but the main-chain atoms may slightly (i.e. 
by about 1 to 2 A) change in position during the SA step. In another embodiment, and 
only for the non-rotameric approach, also the pattern atoms may change in position 
during the adaptation step; this embodiment is least preferred since the introduced 
pattern then looses its original structure and accounts, at least partially, for the structural 

1 5 relaxation effects. 

The SA step itself may be performed by any method suitable for protein structure 
prediction (see STRUCTURE OPTIMIZATION) on the basis of an objective energetic 
scoring function (see ENERGY FUNCTION). The goal of the SA step is to find the 
energetically best possible substituted and adapted structure. In this step, the substituted 

20 reference structure is submitted to a structural optimization process exploring the 
conformational space of the protein system, taking into account (i) the restraints 
imposed onto the backbone and pattern atoms (which are preferably fixed), (ii) the 
allowed conformational freedom (e.g. as determined by a rotamer library) and (iii) the 
applied energy function. The specific procedure used in order to perform the said 

25 structural optimization is not critical to the present invention. Methods which may be 
used for this purpose are described in the section STRUCTURE OPTIMIZATION. 

In the EC step, the ECO energy value is calculated as a differential value, i.e. the 
energy of the substituted and adapted structure obtained from the SA step, minus the 
energy of the reference structure. Both energies can be readily obtained by a standard 

30 energy calculation method using the relevant energy function in order to compute all 
relevant intra-molecular interactions and providing a quantitative estimate for the free or 
potential energy of a protein structure. Therefore the ECO energy value obtained by 
substracting both energies is a quantitative measure for the energetic cost to mutate the 
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t As far as the computed energies correspond 

pattern P into the reference structure. As far as 
with the true thermodynamic free energies of the original (Wl, 

21 reflect the change in thermodynamic stability of the mutated pro^ 
t, the ECO energy value can also be considered as a fair estimate of the 

5 — - — ° f the protein of inter :: 

energy value can be either positive (in which case the p-residues are expected 

the corresponding pattern is expected to stabilize the protein molecule. 

Apart from the ECO pattern and the associated energy value, an ECO^a 
10 include other relevant information, including data referring to the ongm of the ECO 
h as L reference (e.g. a PDB-code) to the (e.g. experimentally determmed) 3D 
such as (i) a reference ( g 2eneration> a reference to the rotamer 

,atter data are important for a correct interpretat,on of the ECO energy. M 

20 :::: — <° — — — t:it„ 

' Tin a— «o restore or re-transfom, these data (e.g. to express the data ,n 
normahzedun^ifneededinaparticularappMcanon. 

. «, - — consider —^■^Ji- 
i ^Vo* it easv to compare sets of ECU s consul 

25 :i:\;;rr — t. — — — -* 
ta tir.:r e i— d da,, - — — - 

ma, „e stored in ,„ BCO o,ect. « of -J*- - ~ 
^ to m a description of the observed secondary structure tor 
3„ bu , are ~ ^^^J^, for some or all atoms of tne pattern residues, 
7^ - »e C* atoms, Wtoh ma y * MP,, , ide^mg 
^^distantresidues^^ 
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solvent accessibility (e.g. the accessible surface area) of each p-residue. While not 
essential for the assessment of the ECO energy value, the storage of such additional data 
may be valuable when using ECO's for specialized applications like high-throughput 
fold recognition or protein design. A more extensive list of additional ECO elements is 
5 provided in the section FORMAL DESCRIPTION OF ECO'S. 

For the sake of compactness, data elements in an ECO object which refer to the 
protein structure, residue types, rotameric states, and so on are preferably encoded as 
indices pointing to data present in external lists, instead of explicitly including the raw 
data. Also, ECO's sharing common properties (e.g. ECO's generated for the same 
10 protein structure, using the same rotamer library and energy function) may be grouped 
into a data structure (e.g. a file, an array or. a database), the shared properties being 
assigned to the data structure rather than included in each ECO object. However, this 
kind of data storage and encoding is purely a mode of implementation, not an essential 
aspect of the present invention. 
15 Many further transformations of the native ECO energy values obtained from the 

EC step may be useful within the context of specific applications of the present 
invention. Examples of such transformations include, without limitation, (i) the 
expression of ECO energies relatively to the lowest -value of all ECO's known for 
different patterns located at the same residue positions, so that each transformed ECO 
20 energy becomes positive, (ii) transforming ECO energies by means of a linear, 
logarithmic or exponential function, (iii) truncating every transformed or non- 
transformed ECO energy above a given threshold value to that threshold value, (iv) the 
expression of ECO's in normalized units, and so on. 

Another type of transformation, i.e. the grouping of native ECO's into grouped 
25 ECO's (referred as GECO's), does form part of the present invention. This grouping is_ 
applicable to the SPA mode wherein a multitude of ECO's are generated for identical 
positions in a protein. In a preferred embodiment of this invention, ECO's sharing 
identical positions in the protein of interest may be grouped into GECO's in order to (i) 
reduce the amount of information and (ii) extract the most significant information. The 
30 importance of ECO grouping as well as some preferred methods to generate GECO's is 
discussed in the section GROUPED ECO'S. 

FORMAL DESCRIPTION OF ECO'S 
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th efonowi„g„ 0 uUo,, sa reu S ed( S eea 1 soUSTOFSYMBOUCNOTAT.ONS). 

! N' the set of residue positions of interest within the protein of interest Exampie , N 
5 !;ino a cidIonesof. h epro,ein,.Exar»p,e 2 :HMan Myb nneda ml no 

acid residues of the protein}. the 

first residue position of N. rtf the nresent 

. A(0 : the set of residue types of .nterest at position , .n the SPA mod of toe pres n 
Mention, this set may contain muitipie residue type, Exarnp, • AO) I- 
natura, amino acid residue types, . Exampie 2: A( 1) = (G.y, Ma Ser). 
. ,. a specific residue type of inter., . a, position ,, where f . ACft Examp.e. , - 

denotes the second residue type of AO), a. the firs, residue position - 
*,,«V the se, of rotamers of interest for residue type a at portion , Rotamers 
' ^ -ieved from either a haOtoone dependent or independent hhrar. 

r , , R,« - R«) - the se, of rotamers known for residue type a » the 

x ( _ which are ^^Z- 
rotamers R(i°) where a refers to a Gly residue type y 

a ™™ element since Gly has no rotatable side chain. 
"> one dummy elemem w"^ ^ / . fl ^ » 

. a specific roomer of interest r for residue «ype . a, position ,, where „ eR0*> A 

1? de„o,es me pattern formed by ,he third rotamer of R(l ), for 
type of A(l), a, the firs, residue position, 
x . ti n thepairofspecificresiduepositionsofinterestiandj. 

portions « and,', respective* (in an undefined conformation). 
I- * me pair of residue positions i and,, where residue type, a and 6 are placed a 

fr; fl ;h frvrms a uniquely defined pair rebmu^ 
30 5, respectively. A specific instance of [i P , * ] forms a un q 

pattern. 

• [i,;, *, ..-]•• *e multiplet of residue positions of interest i,j, k y ... 
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If , ...]: the multiplet of residue positions /,/, k, ... where residue types a, 
b, c, ... are placed at positions /, j, k y respectively (in an undefined 
conformation). 

[irjs, k c ty ...]: the multiplet of residue positions k, ... where residue types a, b y c, 
... are placed at positions ij 9 k 9 respectively, and residue types a, 6, c, ... adopt 
rotameric states r, s, respectively. A specific instance of [i?j* 9 kf, ...] forms a 
uniquely defined multi-residue pattern. 

P: the complete set of patterns of interest when generating ECO's in the SPA mode. 
P(/): the set of single residue patterns of interest at position *, where P(i) c P. By 
analogy, P([/,y]) and, in general, P([i,y, *, ...]) is defined for pair and multiple 
residue patterns. 

P(0: the set of single residue patterns for residue type a at position z, where c 
P(i) S P. By analogy, P([/",/]) and, in general, P([z«,/, lf 9 ...]) is defined for pair 
and multiple residue patterns having defined residue types. 

ref: the reference structure of the protein of interest. Preferably the reference 
structure is the energetically best possible global structure for the protein of interest. 
In the preferred embodiment of the invention wherein rotamers are used to describe 
the allowed conformational states of residue types, the reference structure can be 
written as a special pattern [/ g r ,y g r , k* y ...]. In this pattern, all N residues ... of 

the protein (excluding those of the template) assume the reference amino acid type 
as defined in the reference sequence (referred to as the "reference type" r), and a 
particular rotameric state as obtained from a structure prediction method such as a 
DEE method (referred to as the "global minimum energy rotamer" g). Note that r 
and g are not variables and are therefore not written in italic. Less preferably, the 
reference structure may be any energetically relaxed, e.g. energy minimized by a 
steepest descent or conjugated gradient method, structure. Least preferably, the 
reference structure may be any non-optimized experimental or theoretical structure 
for the protein of interest or, alternatively, the reference structure may be void and 
may be assigned a reference energy E ref = 0. 

E re f : the global energy of the reference structure. In the preferred embodiment using 
rotameric descriptions, a fixed template and a pair-wise summable energy function, 
E re f can be conveniently written as 
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E ref = E templat e + I, WO **» (eq - 4> 

H F is the template self-energy, i.e. the sum of all atomic interactions 
Here, W « ^^^^j^^^*^^"***-*** 
.Uhin the template^ su^ np ^ ^ ^ 

subscripts g denote the ^ ^ 

Et(lg <) is the direct interacts energy of , ^ . soWent 

Lion ENERGY 

— , and,, 
Ontionallv the term E te mpiate can oe uimnw n 

Optionally, « discussed be ow, wherein the 

a fin F . is to be used in combination with E(ir), discuss™ 

and (n) Eref is to be us io not a elobal but a partial energy). A 

same term can be discarded (,n such case E f , no^ a g ^ 
great advantage of conation (4) is that the E, ad *™^J ^ ^ 

repeated* b a , ess preferred embodiments wherein no 

computation of ECO energy va n0 n-pair-wise summable 

_ are used and/or the tempiate is no, fixed J 
energy mode, is appiied, the reference energy ^ * < „„ all 

possib.e inter-atomic "^J^ s c onsidered, Erer ma, be set 
preferred embodiment, where no reference stru 

.1 *, O or any arbitrary value, with the consequence that ECO energies 
equal to 0 or any arnirrary Hflwever such embodiments may 

ionger be interpreted as "substitution energies . However, s» 
i, be useful in specific appiications wherein the absoiute vah.es of ECO energies 

. E ,, ; the globa, energy for the reference ^^^L. 

« - — - reSi ^ ^ III" C ™ OF TO BCO 

.d (2, a SA step as -* for . ^ 

co rr fa isr- . « — and . ^ 
il- * — - - — - — ^ f ° uowmE 
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equation (5), wherein the terms applicable to residue / of the single residue 
pattern if are replaced by the appropriate energy contributions: 

E(if) = E lem piat e + E t (/*) + Z y Ep(/ r ° J %) + Z,- E,(/t) +Zy E p (/ ^, ; y,**/ ( eq .5). 

5 

In this equation, the residue positions jjc*i may have quit their reference rotameric 
state r as a result of the SA step, but not their reference type g: Adaptation occurs by 
conformational rearrangements wherein type switches are not allowed. In the 
optional case that E te mpiate has been omitted in the calculation of E ref , the same term 

10 should be omitted in equation (5) as well and E(if) is not a global but a partial 

energy. In a less preferred embodiment wherein no rotamers are used and/or the 
template is not fixed and/or a non-pair-wise summable energy function is applied, 
the value for E(i r a ) must be calculated similarly to the reference energy E ref , typically 
as a sum over all possible inter-atomic interactions within the substituted and 

15 adapted structure. 

• E E coO'r): the ECO energy value for the single residue pattern if. As previously 
defined, it is the energy of the reference structure which has been substituted by and 
conformationally adapted to the pattern if , E(if) 9 minusthe energy of the reference 
structure: 

20 EecoOV") =E(/ r *)-E ref • (eq .6). 

This definition is valid for all embodiments of the present invention. By analogy, it 
is also possible to define ECO energies for pair patterns. In the preferred 
embodiment using rotamers, a fixed template and a pair-wise energy function, the 
global energy of the reference structure substituted by and adapted to a pair residue 

25 pattern [ifjf] can be written as follows: 

mtjs*]) = E^te + E t (if) + E t (/*) + E p (ifj?) + Z* E p (if, tf) + Z* Ep</*, #) 

+ I* Et(#) +Z* Z/>* Ep(tf , /J) ; kMJ (eq.7) 

The ECO energy for a pair residue pattern is then 

E E co([/V a ,y/]) = E([/°,y *]) - E ref (eq .8) 

30 In the less preferred embodiments wherein no rotamers are used and/or the template 
is not fixed and/or a non-pair- wise summable energy function is applied, the value for 
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Bttr/D — - " a-rda„ce with the characteristics of^e 

JL the — - - adapted ~- - -J- - ? 
patterns, the calculation of ECO energies occurs P 

.• r„. th. FPO enerev of a multi-residue pattern \i r ,j, , *i> lb 
Here, only the equation for the EUJ energy u 

giVe " : (=q.9) 

ir, order to complete the above forma, description of ECO's a more concrete 
description of the term is now provided. An energetic compatibility ob.e , , 
lied data se, comprising element or pieces of information. Here, a d,s inction 
T nl between (i) basic elements and (ii) optional elements, depending on 
: I C - — - a correct execution of an ECO-based application or 
2Z die, may simply be helpful in such execution. Ceaily, trn, dis.mct.on depends 

onthespecificapplicationofinteresttoauser. „ fi ,„ description of 

ThebasicelementsofanECOare^osewruchtogetherfonnafulldescnpUon 

the origin and characteristics of a .bstituted and adapted protem squire, i.e. . 

II a descriptionoftheappliedro a merl.brary,.fany.Inth=prete 

die present invention, this corresponds to die assignment of a se, ° ™ * 
to elh position , defmed in (V). For positions , being pattern residues, . ref^s to 
Tpanem residue type, whereas for non-pattern positions of interest, . refers to the 
residue type as defined in (VI). 
IV a description of the applied energy function. .,•,„,», This set 

. description of the se, N, i.e. *e set of residue positions . of in,^ Tta£ 
consist of positions in uie protein a. which a patiem P .s ph-d (P"»» 
;:,L), alwel, as positions which are considered ,n tire 
posits). The complement of N is defined as the template which, in the preferre 

embodiment of this invention, is fixed. _ nt nfone 
. description ofthe amino acid reference sequence, i.e. *e assignment of one 

iA +, m( >" r to each position i defined m v v ). 
specific "reference amino acid type r to eacn pu 
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VII. the method applied to compute the reference structure for (VI), starting from 
(I) or (H), and using (III) and (IV) (e.g. a DEE method). 

VIII. a description of the reference structure for the residue positions and types defined 
in (V) and (VI) respectively and obtained by the method defined in (VII), e.g. the 

5 assignment of a "global optimum" rotamer g to each position defined in (V). 

IX. the pattern residue position(s), i.e. k, . . .] 

X. the pattern residue type(s), i.e. [f y f y k c y . . .] 

XI. the pattern residue conformations, i.e. [irjL tf, ..:] (this may be optional for 
grouped ECO's or in the embodiments not using rotamers). 

10 XII. the method applied to perform the SA step for the reference structure defined in 
(VIII), wherein the pattern defined in (IX), (X) and (XI) is introduced, preferably 
being the same method as used in (VII). 

XIII. the ECO energy value E EC o([/r , js , tf, ...]) as defined by equation (9) or, 
alternatively, equations (6) and (8) for single and pair residue patterns, 

15 respectively. 

Optional elements of an ECO are in particular: 

XIV. a specification of the residues which have changed their conformation (e.g. by 
adopting a rotameric state different from g) during the SA step, as well as a 
description of the actual changes such as, preferably, the new rotamers. 

20 XV for grouped ECO's, a reference to the grouping method, as well as a 
comprehensive description of the results of the grouping process. 

XVI. one or more functional properties of the protein of interest. 

XVII. a local secondary structure description of the pattern residue positions [/,/, k, ...] 
within the protein structure as in (I). 

25 XVIII. a description of the solvent accessibility of the pattern residues within the 

reference structure (i.e. [i g r ,y g r , £ g r , ...]) and/or within the substituted and adapted 

structure (i.e. [i r fl ,y* f Af, ...]). 
XIX. cartesian coordinates of one or more atoms of the pattern residues within the 

reference structure and/or within the substituted and adapted structure. 
30 XX. crystallographic temperature factors associated with the (atoms of) the pattern 

residues within the protein structure from which the reference structure has been 

deduced. 

XXI. known chemical modifications associated with the pattern residues within the 
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protein suture from which 0. reference s,n.c,ure has been deduced. 
XX.1 physico-chemica. properties associated with the pattern residue types 
dearly, lis Hst can be enlarged with any addition* property of in«eres. to the use, 

SYSTEMATICAL PATTERN ANALYSIS (SPA) MODE 

ECO^ may be generated by the present method a, the time they are needed, for 
.ample, „ search locations in a protein where *yp,ophan substitutio „s arcane L 
Dozens of questions of this type may be answered by one or more real-Ume, case 
! p :lc palm ana>ysis experiment. However, some stn—Iated questions can 
IT be Lived by me present method if vast amounts of ECO da. are avatlab,. 
:Hr*e same ECO information may be used recurrent* within - * 
Ptotein design experiments. Aiso, specific design and recognition may * 
performed in the absence of any explici, 3D-represen<a,ion of a proten, Fu^y, P- 
Ceo se« (or daubases) of SCO's ma, be dati, miued to ga.n strucmre- and 
Znce-reiated insrghts, again creating added va!ue. Therefore, there ,s a need » 
genera* ECO-s in a systematic way, caUed the "systematica, pattern ana.ys,s (SPA) 

^ZZZ, conventions and abbreviations used.be.ow conform ,0 , hose listed in 
the section FORMAL DESCRIPTION OF ECO'S. 
0 the SPA mode of the present invention, wherein only the rotiunenc embody, 

is J£ ECO', are systematica,* generated for a pre-defined se, P of pa«erns being 
I^Lmereferencestrttcmreofaprotein of ^^^^ 
to be made between single and double residue ECO'S. MuH.-res.due ECO s are » 
Ildered in the SPA mode on.y because their systematica, computation may be too 



25 slow 



30 



For sing,e ECO's, me se. P is constructed as foUows. Fh* a se, I of residue 

jl' i — . — > * • — - n *• — - rrrr 

modeled residues). A, each position U., se, A(0 of res,due types of *»* 

an be interpreted as a singie residue pattern .„ an as ye, undefined conforms. « » .* 
n'elrthy that a. differen. positions ,, different se. of types A(0 may be 
; rrich defined position/type combination f. a se, R(0 comprismg _ o 

« 1 for W . position , is =s,b,ished, preferab.y retrieved from a ro.an.er 
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library. A rotamer r, generated for type a, placed at position i is noted as if , where if 
s K(i°) and where if can be interpreted as an unambiguous single residue pattern p e P. 

For double ECO's, basically the same approach is followed, but all combinations of 
types and rotamers need to be considered for all possible pairs of positions. More 
5 specifically, define by ij e la pair of residue positions, by f e A(z) and/ e A(J) a 
pair of residue types at positions i and j 9 respectively, and by if e R(z") and/* e R(/*) a 
pair of rotamers for types a and b at positions i and /, respectively, then the set P 
consists of all possible unique and unambiguous pair-residue patterns [ifj%] 9 where i < 
h 

10 In the SPA mode, the reference structure for the protein of interest needs to be 

generated only once (see REFERENCE STRUCTURE) and the associated reference 
energy E rc f can be used in the calculation of the ECO energy value for ail patterns of the 
set P. Then all patterns of P are substituted into the reference structure, the structure is 
adapted to the pattern and the total energy is calculated using equation (5) or (7) for 

15 single or double-residue patterns, respectively. This yields necessary and sufficient data 
to calculate the ECO energy associated with each single and/or double residue pattern, 
using eqs. (6) and/or (8), respectively. 

Finally, the generated data are stored on a suitabLe storage device (preferably into 
an array, a file or a database) where typically one object per pattern is created; each 

20 object contains (some of) the data and references described in the section FORMAL 
DESCRIPTION OF ECO'S. Post-analysis operations including transformations (see 
DESCRIPTION OF THE ECO CONCEPT) or grouping operations (see GROUPED 
ECO'S) may then be applied. 

25 GROUPED ECO'S (GECO'S) 

Besides energetic transformations (see DESCRIPTION OF THE ECO 
CONCEPT), the most important post-analysis method, applicable to ECO's generated 
in the SPA-mode of the present invention, is the grouping of ECO's into GECO's. 
While there is a nearly infinite variety of possible grouping operations, we shall 
30 specifically address GECO's which are preferred in view of their practical usefulness 
and ease of interpretation i.e. (i) single GECO's, (ii) double GECO's and (iii) GECO's 
resulting from a grouping operation performed on the basis of residue type properties. 

A single GECO, denoted G(0, is a GECO derived from an ensemble of single 
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EGO'S generated in SPA-mode, for a specific amino acid residue type a 
residue ECO s, generate ense mble only 

placed at position.', and a set of rotamers R(. ), . e. the patte „ . ... 

differ in Imeric state. Each ECO of this ensemble can be seen as the resu . of a t™ 
CriLen, wherein the compatibility of Us associated pattern, i.e. a p— res,** 
5 Z - - POsWon , in rotameric state r. is investigated within the context of an ad^ 
5 « struct. A clear measure for the said — * ^ 

w „ F (ft In a preferred embodiment, the set of E EC oOr) values 
FCO of this ensemble, E EC o(w- m a P IC1C , 
1" g Wen residue position and type f and all rotamers r e R(.) can simply be group 
I, Aching over all rotameHc states for the minimum value for E EC oOV«) 
10 equation (10): ^ lQ) 

E G eco(0 = min ' £ eco0") 

„ • p (A is the energy value resulting from the minimum searching operation 
wherein EgecoO ) »* tHe energy rotamer 

.ow-energy values, wherein the Utter set includes all va.ues for EscoW he.ow 
20 suiUbly chosen threshold valueT, according to equation (11). 

(eq.ll) 

EgecoO" 3 ) = aver ' (EecoOV 0 ) I EecoOV") ^ T) 
wherein aver, is the averaging operator wording on W« values below T for a,. 

a When applying equation (11), some residue types a may no, be 
reamers r a, , When app y. g q ^ ^ ^ 

compatible with position i, ..e. .f no EecoO,) value exts 
25 grouping method, i, is impossible to assign a specific rotamer r* £ ~ ^ 
but mis is irrelevant if the grouping process ,s only for the purpose 

—r^ — * - - - . — - — 

function such as shown in equation (12): 

(eq.12) 

30 ECECOCO = ^ EEC0(ir) p0"r ) 



wherein 
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p(/ r fl ) = exp(-pE EC o(/r))/Z(/ r a ) "(eq- 13) 

and wherein p is a suitable constant (e.g. p=l/(k B T) wherein k B is Boltzmann's constant 
and T is the temperature in degrees Kelvin) and Z{i?) is the partition function defined by 
equation (14) as: 

5 Z(/V°) = S r exp(-PE EC o(^)) (eq . 14) 

This grouping method is highly preferred since it is statistically supported and 
requires no absolute energetic threshold value. 

A double GECO, denoted G([?, f]), is a grouped ECO derived from an 
ensemble of double residue ECO's, generated in SPA-mode, for pair residue patterns 
10 implying [/",/]. G([/ a ,/]) which provides information regarding the compatibility of 
the pair of residue types a and b, at positions i and j, respectively, without necessarily 
specifying their rotameric states. The possibilities to group double ECO's for pair 
residue patterns [i?,jf\ are analogous to those for grouping single ECO's. In a first 
embodiment, the minimum energy value for all pair-wise combinations of rotamers r 
1 5 and 5 can be searched according to equation (15): 

EcEco([/'r ,/»*]) = min rj E EC o([i?J?]) ( eq . 1 5) 

wherein min r .j is the minimum searching operator. 

A second embodiment involves averaging combinations having a pair energy below a 
given threshold, following equation (16): 

20 E GE co([ir Jh) = aver,., (E ECO ([/r jti) I EecoWJ?]) ^ T) (eq . 1 6) 

wherein aver r is the averaging operator. 

A third embodiment is a statistical averaging according to equation (17): 

E GE co([/r a ,^]) = X„ E E co([/V a ,.A*]) p([/r fl ,y*]) ( eq .l7) 
wherein 

25 P([*W/]) = exp(-pE EC o([/ r a ,yi > ]))/Z([/ r fl ,yi']) (eq . 18) 
and the partition function Z{[i°,jtD is 

Z([irV,*]) = Sr., exp^pEEcotfiVU*])) (eq. 1 9) 
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• -c^the level of amino acid features or 

Another type of grouping is at the 

which is useful in assessing the compatibility ot a 
properties rather than rotamers, which is userui 

P r • e s an aromatic amino acid, at residue position u Indeed, 

certain type of amino acid, e.g. an arumai 

certain type . ^ fcature of property 

GECO's can be constructed by grouping am 

u rr,^ where Pol = {S,T,N,D,E,Q,R^} denotes a GECO for a 
5 of interest. For example, G(i ) where i 

v«„ / The energy value associated with such a GECU can De 
polar amino acid at position i. The energy va 

suitably defined by equation (20) as: 

(eq.20) 

E G EcoO' Kind ) = aver aeK ind EgecoCO 

Such a grouping is prefer performed on GECO's aueady 
,0 r— sta.es). 0*er potent use*, grouping op=ra,.ons ,nc.ude, bu. are 

limited to, sets of: 

. aromatic amino acid types where Kind = {H,F,Y,W>, 
. small residue types where Kind = {G,A} or {GAS}, 
. p-branched types where Kind = {I,V,T} , 

15 ^ '^erany speaWng, «us *~ W of grouping is in.ended for and suUaWe .0 
Generally sp g ^ rf ^ ^ ^ whlch may 

condense pnmary ,nf— .ntts ^ ^ ^ 

be useful, for example, in fold recognition vy 

against structure-based profiles. 
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ASSESSING GLOBAL ENERGIES FROM SINGLE AND DOUBLE (QECO'S 
' I pL. — a,so re,a K s .0 a m e,nod » assess - **- 

• wh an arbitrary number of substitutions have taken place 

=£=5==£St 
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relationship between single and double ECO energies. Then this relationship is 
further analyzed, thereby demonstrating how double ECO energies can be derived from 
single ECO energies. Next, this method is extended so as to compute the energy of 
multi-residue patterns from single ECO energies. Finally, it is discussed how single and 
double GECO's can be used to assess the global energy of a multiply substituted protein 
structure. 

In the preferred embodiment of the present invention, i.e. when using rotamers, 
a fixed template and a pair-wise energy function, the global energy of a reference amino 
acid sequence in its reference structure is given by equation (4). When isolating, in this 
equation, the energetic contributions for two residue positions of interest i and y, we 
obtain equation (21): 

E rc f = Etemplate + 2* E t (k£ +Z* E/>A E p (* g r , l£) 

+ 2* E p (/ g r , Ag 1 ) + H k E p (/ g r , A£ ; &f; kMj (eq.21) 

wherein all terms and symbols have the same meaning as in equation (4). 

A similar equation for E(z£) can be derived from equation (5), wherein the 
contributions for the residue positions / and y are isolated, recalling that E(#) is the 
energy for the reference structure in which the reference rotamer f % has been replaced by 
the pattern rotamer f r and where the other residues have had the opportunity to assume a 
new lowest energy rotamer g' which may or may not be different from the reference 
rotamer g. Thus, we can write E(i^) as 

E(# = E temp iate+ Z* E t (A:^) +Z* Z,>* E p (k T ^ l x r ) 
+ ^ + Et(/r) + Ep(ftyi.) 

+ 2* Ep(ft *£.) + Z A EpOg., k T r ) ; z*y ; kjMj (eq.22) 

Likewise, we obtain for E(/t), i.e. the global energy of the reference structure 
substituted by and adapted to the pattern yj, wherein possible rotameric adaptations are 
denoted by the subscript g": 

E(/t) = E temp iate+ E t (k r g .) +Z A 2,>* Ep(A;., /J.) 

+ E l (/^) + E t (/t) + E p (/^yt) 

+ Z* EpC/J., k\.) + X k E P 0t k\.) ; My; k 9 MJ (eq.23) 
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The energy of the pair pattern [i r , W» 

denoted by the subscript g"is given by equation (24): 
E([^) = W« + U WV) *»* ^ 

on f2» for E ef is subtracted from equations (22) to (24), we obtain 
If the express (21) for E^ss ^ ^ ^ 

the equations for the single ECO energies Eeco(# and EbcoW 
energy Ehco([«j5D as follows: 

+ 2t( E^^^ ;^^,(e,25) 

5 E E co(^^ = ^CE^^ 

+ (RCO-ftC©) + (E«0i)"Et0 r g)) + (EpO.ithEpOg.^) 

Each of m . <*> — - • 

orto ea specify .0 tnaUe *en, more interne, as *. 

lin e, in congas, — on., tenns ^ > ^, 
H „e — interactions between pa«em posmons ^ 

Moreover, a„ _ f^^* o. tenns Uteres 

25 reflects the difference brought about by a pane 
can substitute 
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P = ir 



2aE t (^0-E t (AO)+2,Z /> ^(E p (A:^/^-Ep(^ g r ,/ g r )) 
AE^if (p) = j 2* (E t (kg>)-E t (£g ))+£*£,>* (E p (kg* ,/g- )-E p (&g ,/g )) ; (eq .28) 
2*(E,(*£)-E t (*^^ ; p =[/ r %y*] 

where AEs e i f (/>)is preferably interpreted as the difference in the global internal (or 
"self) energy of the non-pattern residues k y teij (compared to the reference structure) 
due to conformational rearrangements induced by pattern p 9 where p is either f ry y* or [z* 
yjj. Similarly, we can substitute 



X(^(/r%^)-E p (i;,*;))+2;cE P cy^*i)-E p (y;,*;)) 

2i(Ep(i^,^)-E p (i;,* g r ))+2^(E P o-i,^)-E P c/i ,* g r )) ; />=y; 
^ W;,*^)-E p (i;^))+5;(E P c/,*,^)-E P c/i,*;)) ; P =[i° r j b s ] 



\P = ir 
■b 



(eq.29) 



where AE is preferably interpreted as the difference in total interaction energy 
between the pattern residues ij and the non-pattern residues fctij (compared to the 
reference structure) due to conformational rearrangements induced by pattern p, where p 
is either ? r , j* or [f r ,j% When entering the former two quantities into equations (25) to 
(27), we obtain 



Eeco(j?) = 



15 EecoOJ) = 



+ (E t (#-Et0g)) + WWS)) + (E P (^y g .>-E p (z- g ,/ g )) 

+ AE|i; t k (/ r fl ) 
AE*, f (y*) 

+ m w F yU&) + (E,(/t)-E,(/ g )) + (Ep(/ g -,7t>-Ep(i 8 ,7g)) 



(eq.30) 



(eq.31) 



E E co([&y?]) = AE^ffliV".;,*]) 

+ (E t (#-E«(&) + (E.O^EtOg)) + (E p ( J %yt)-E p (i g ,7^ g )) 
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(eq.32) 



This set of equations now allows to write a double ECO energy as a function of 
single ECO energies. For this purpose, equations (30) and (31) are subtracted from 
equation (32). 

rWlff ,iS> " e eco(« - EbcoOS) - 

AEU(['?.^»-( AE »" (l?)+AE:tlrL '* )) 

+ aeL* (W , j? 1) - ( aeK." ) + US » 

- (EfilHOi)) - (E*'«-)-E.(i' ! )) 

+ (E,(ft^^-<M^r^A»-<^r.^4A)) (•*»> 

The terms appearing in the second and ted Hne of equation (33) may be 
anaiysed as foUows. Firs,, they measure the energetic consequences for the non-pattern 
residues **V due to the introduction of the patterns ft f, and «.A Secondly, ■« can be 
safe* assumed ma, introducing any singte or pair residue patiem is uniikely to change 
formation of an entire protein stnrcture, whereas the introduction of e.g. a * 
restdue pattern C is ** ,o cause adaptation even, for oniy a tinuted numb, o 
r esidues * in its — e en— . In other words, i, , ejected drat a^ -n 
stnl c,ure shows some « W behaviour with respect ,o perturbations of tts sfructure 
£ ute reference structure), i.e. smai, patterns are Uy » induce sma.1, loca change. 
Taking .his expectation for hue, , can aiso be e X pec,ed tha, ,wo differen, smgie restdue 
ZL * - fi - -ve non-oveHapping associated sets of adapted re^ 
LnaUy defining by K<p) me se, of non-patiem residues ^ whrch adopt a d ffcr» 
rotaD ,eric state (compared to the reference stature) a«er me ,—n 

we have: 

(eq.34) 

K(^{*l*' r ** r g > (eq . 35) 
K0*)-{*l* r r** r .) (eq36) 

,. nninfJ sets of adapted residues associated with 
The condition for non-overlappmg sets 01 <»u v 

patterns f r and:,* is then represented by equation (37): 

(eq.37) 

K(z?)nK(/t) = 0 
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A question addressed by the method of the present invention is 

then to define the relationship between the sets K(i% K(/f) and K([£yfj) and, more 
specifically, the conformation of the elements in each set. Ignoring all other theoretical 
possibilities, we introduce herein an hypothesis called "additivity of adaptation" (AA) 
which is illustrated in figure 3 and assumes that conformational adaptation effects 
caused by a pair pattern are identical to the sum of all adaptation effects caused 

by each of its constituting single residue patterns £ and jt A first criterion for this 
hypothesis to be fulfilled is that the introduction of a pair pattern [z?,yf] affects exactly 
the same residue positions ktij as being affected by either the single residue pattern f r 
or/ s , which may be formally represented by equation (38): 

K(# u K(/t) = K([ft /J]) (eq38) 
A further requirement for AA is that all rotamers k\^k\ induced by the pattern 
[wtl are found among the rotamers induced by either the single residue pattern f r or jt 
This criterion can be expressed formally by the following equation: 

r £g r "=*g' ;V£|*eK(/?) 

\ki-~kir ;Vk\keKUl) (eq39> 

This criterion is unambiguous if the conditions sef forth in equations (37) and 
(38) are fulfilled. Then, under the AA hypothesis, the equations (30) to (32) can be 
drastically simplified, e.g. by combining equations (28) and (39): 

AE s k eIf ([/*,;*])- AE s k elf (/*)-AE k eIf (y >) = o (eq.40) 
and, similarly, by combining equations (29) and (39): 

jlD~ AE^(/;).AE^(/^) =0 (eq.41) 
so that equation (33) simplifies to: 

Eeco([£;?]) = E E co(# + E ECO (/t) 

- (E t (/^E t 0 F J)) - (E t (/g.)-Ei(ig)) 

+ (E p (£;tHE p (/^ (eq.42) 

When the AA criteria are met, equation (42) provides a means to compute the 
energy of a double ECO from the energies of single ECO's. This equation only requires 
terms directly involving pattern positions i and j\ not terms related to non-pattern 
positions. The said terms involving pattern positions / and j can be readily computed as 
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a ■ in ^ energies are known and can be considered as corrects 
template (E t ) and pa, (E p ) energ.es ar ^ ^ 

needed to derive double ECO energies from single fcUU 

* t on E «■ A comprising the said energetic corrections needed to fuse 
into a new function Efuse(ir,.W> cunipn»« 6 

. ■ j„„ w „ pro enerev, defined in equation (43). 

two single ECO energies into one double ECO energy, 

Then, still under the AA hypothesis, equation (42) becomes: 

<, (eq.44) 
EEoottftA) = EHCO03 + W/& + EfuscC^t) 

When the AA criteria are not met as illustrated in Fig. 3b, errors in equation (44) 
, due to ignoring non-additivity effects can be included in a non-additivity term lW„ 

,t) leading to a universally applicable expression: 

ii t- c" (eq-45) 
Eecott&A) " WO + EECo(,t) + Bs "* <wt) + E ~" ( " , - fi> 

The firs, three terms a. the right hand side of equation (4 5 ), as we.l as the double 
be exactly calculated. Therefore, equation (45) provides a pracUcai 

bemggrop ,ui„„ for a huge reduction of data wtale keeping only the 

W e W and,con^uenay,a,,owforahug ^ 

mn „, relevant information. Since GECO s reiieci uu 

most relevant ^ & of ^ 

necessarily rotamers) can fit at specmc p ooss ible to 

• , ^ h™hi<» fiFCO's would possibly make it possioic iu 
relationship between single and double GECO wo P 

answer questions like: substitution 
If a double substitution [f./] is energetically compatible, is each sing 
M and tf] energetically compatible as well, and vice-versa?, and 
, w . r rm'c be easily derived from single GECO's? 
- - doubleGEC ° Sbee J sis is to predict the energe tic compatibility of multiple 
The ultimate aim of such analysis to pr ^ 
substitutions from information related to single and dou le GECO s ^ 
protein design and fold recognition applications by allowing detailed, structure 
predictions without actually considering any 3D structure. 
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By analogy to equation (45), the following relationship between 

single and double GECO energies is first assumed: 

EGECott*",/]) = EgeooCO + EOECOC/ 6 ) + EfoseO",/) + Enon.AAO*/) (eq ^ 

wherein EgecoA^,/]) is the energy of a GECO of a special pattern p defined on residue 
positions / and/or j carrying types a and/or b y respectively, but where the rotameric 
states is are not necessarily defined. The non-additivity term Enon-A^z",/) has the same 
meaning as previously, i.e. it reflects the energetic consequences of a possible 
inconsistence between adaptations caused by patterns f and f versus [z*,/]. Although 
some ambiguity may lie in this term since the rotameric state of each of these patterns 
may be undefined, except for GECO's derived by equations (10) and (15), however this 
term can simply be ignored since the method of the present invention is preferably used 
under the AA hypothesis. In equation (46), EtaO*,/) may be considered as a correction 
term needed to take into account that f has been considered in the reference structure 
where j had type r (and mutatis mutandis for f). Since both types a and b are likely to 
be different from the corresponding types in the reference structure, this fuse-term can 
adopt a wide variety of values, both negative and positive. In contrast to the foregoing 
description, it would be in conflict with the basic GECO concept to attempt to calculate 
EfuseO* /) by similarity with equation (43) for the following reason. If the rotameric 
states of the involved residues are unknown or irrelevant, then it becomes impossible or 
irrelevant to consider the associated rotamer/template and rotamer/rotamer interaction 
energies as in equation (43). A much more practical way to assign Ef^i*,/) values is 
to derive them directly from the single and double GECO energies according to 
equation (47): 

E^eO",/) - E GECO ([^/]) - (E G eco(0 + E GECO (/*)) (eq.4 

which is true when there is additivity in the adaptation effects caused by the pair of 
single GECO's and the double GECO. However, when the AA hypothesis is not 
fulfilled, the energetic consequences (actually, E non -AA(i a , /)) will be incorporated into 
the fuse term EfuseO^,/^). Anyhow, equation (47) can be used to compute double GECO 
energies from a database containing single GECO energies and pair-wise fuse terms 
without effectively generating the double GECO. 

Another aspect of the present invention is a method to compute the energy of 
multiply substituted proteins from single and double GECO energies. Considering that: 
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. equation (45) means to. ft. substitution energy of a multiple, patten, 
(such as a pair) can be derived from the substitution energy of basic patterns (,e. 

single residues), . . 

the present invention relates to energy functions which are essenttally pa.r-w.se 
5 additive in .he sense that me global energy of a protein structure can be wntten m a 

form equivalent to equation (4), 
. me fuse term in equation (45) mainly contains energetic contributions wh.cn are 

directly and exclusively related to pattern residue positions, and 
.he non-additivity effects are expected to be small in size compared to me fuse .erm » 

10 ( ring into account me previous e,emen.s and remarks, we postulate ma. .he total 
substitution energy of a protein sm.ctt.re comprising a multiple, pattern of the type [, , 



15 



/, fc c , • • 1 can be approximated by the equation 

E GEC o(ir , A -])* EgecoCO + EoecoO*) + EgecoC^ 

+ E filS e(z^*) + E^^ C ) + - 



+ 



(eq- 48) 
+ ... 
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,f me pattern residue positions are no, exp.icitly given a specific name, then 
equation (48) can be written in a simpler form: 

where p is a multiple, pattern defined on any number of residue positions of interest i 
and a residue type of interest a(0 is placed at position , in an undefined _c sutte, 

, . „ me sta6 |e residue OECO energy for f*>, and where Eh* , 

and where EoecoC ) » *e smgle 

m is me fuse term for two single res.due patterns , andy , preie 

from the corresponding double GECO energy and b.m single GECO energies usmg 

^a^uation (49) form, an essentia, aspec, of me present invention since tt 
indica.es L fte energetic compatibility of any amino acid sequence of m,eres, w,tiun 

wise energy terms (which ate preferably stored in a database). 

Potential errors in the caption of EohcoW by means of equation (49) compared 
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to when p would be effectively substituted and modeled into the reference structure, 
may result from different sources. However, the quantitative global error in equation 
(49) can be included in an extra error term for the multiplet pattern p, E^^), so that 
we obtain 

Egeco(p) = Z, Egeco0« (0 ) + 2, £,>,- EtaeC^/W) + E^) (eq. 5( 

The latter equation provides a practical means to quantitatively estimate the total 
error for a variety of different patterns p. Indeed, if E GEC o(p) computed with equation 
(49) is renamed as Eggg (p) and E GBCO (p) resulting from the true modeling of;? in the 
reference structure is referred to as ESeco(^) , then 

E™^) = Eggs (P) ~ ESIco (P) (eq . 5 1 

In the section FOLD RECOGNITION APPLICATION, a correlation is made 
between the approximate and modeled GECO energies for a selected protein structure 
and a wide variety of different patterns substituted into the topologically most difficult 
part of the protein, i.e. the core. 

PROOF OF PRINCIPLE 

Hereinafter, the practical usefulness of the ECO concept with respect to two 
important fields of application is illustrated. In a first experiment, a fold recognition 
simulation is performed by searching a near-optimal correlation between a target amino 
acid sequence of interest and a transformed-GECO profile derived for a protein 
structure which is homologous to the target sequence. In a second experiment, a protein 
design experiment is carried out on three selected positions in the Bl domain of protein 
G (PDB code 1PGA). More specifically, at three selected residue positions 100 
randomly chosen triple mutations were modeled and the global energy of each modeled 
structure was compared to the energy calculated on the basis of single and double 
GECO energies, in accordance with equation (49). From a correlation plot between the 
two sets of energy data, it follows that the global energy of a multiply substituted 
protein structure may be estimated from only single and double GECO energies with 
sufficient accuracy. Since the latter computations occur extremely fast compared to the 
effective modeling of multiple substitutions, the method based on single and double 
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• _m ~~*<> * of amino acid sequences which are 
GECO's may be useful to rapidly generate a set ot amino 

likely to be compatible with a protein structure. 

FOLD RECOGNITION APPLICATION 

m «lifv the potential of the ECO concept with respect to fold 
In order to exemplify tne poienuai ^ 

.cognition applications, single residue ECO, we, generated for hen 
code *ZT) by applying the SPA embodiment of the present invennon (see TABLE 1 
2 ^ 1 settings and da,). Ne X , the singie residue ECO, were grouped mto 
singl e residue GECO, for each residue position/type combinat.o» by — g th 
, minima, ECO energy value over aU rovers, in accordance w.th » <^ 
T ab,e 2). The resuiting set of GECO, is hereinafter referred to as the tnma GECO 
J, f ,.e or the protein studied. The initia. profile for 3LZT was stored to ** so *a 
different version, of a reception method (described below, could be tested us,ng the 

5 ^t'bCbserved that five different consecutive trans— of GECO 
energies were helpful to obtain a near-optima! alignment between a target am.no «d 
sequence and a GECO profile. A firs, transformation, performed on the ,n,t,a. GECO 
rL 3LZT invoked me Nation of the distribution of GECO energ.es for a 
STI^L. by the lowest va,e observed for tha, position over a„ residue 

20 tvoes The transformation equation used was 

Egeco,i(0 = EgecoCO - nun. E GE coO ) 
v, „ if) is the transformed GECO energy for residue type a at posmon ,, 

";: ( ossible E GE co(0 value at position , As a result of ^J^Z 
25 transformed GECO energies became equal to or greater than zero. A s con 
25 transiumi* values to a maximal value ot 

transformation included the truncate of all EobovO ) values 

100, as formally expressed by equation (53): 

(eq. 53) 

E G eco^)= min(E GE co,.(O,100) 
. .here min is ,e min-operator acting on its two ^^WJ - ^ 
30 of .00. As a resuit, strongly gained substitute (see Table 2) rece. d a 

value of 100. A third transformation was a logarithmic rescahng of Eobco.* ) values by 
applying the following equation: 
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EgecojO") = log (1 + Egeco^O") x 0.99) 

whereby a E G eco,2(0 interval [0,100] was mapped onto a E GEC0> 3(0 interval [0,2] by a 
decimal logarithm function. Next, these values were converted to normalized units as 
follows: first, a fourth transformation was performed on the distributions of E^co^O"") 
5 values for each position /. These distributions were assumed to be Gaussian (which is, 
admittedly, an approximation) and the average value, aver a (E GE co,3(i a )), and standard 
deviation, std a (E G Eco.3(0)» were calculated for each position i over all residue types a. 
Then, equation (55) 

EgecoXO = (EgecojO*) - aver^EcEcoM/ 0 ))) / std 0 (E G Eco3(0) (eq. 55) 

10 was performed on all E GEC o,3(0 values, which essentially rescales the transformed 
GECO energies to a number of standard deviations above or below the average value 
observed for a position /. Negative E GE co,4(0 values may be associated with 
"favorable" or "better than average" substitutions whereas positive values may be seen 
as "unfavorable" substitutions. Next, a fifth transformation was performed on the 

15 E GEC o,4(0 values for the 20 residue type distributions by the equation 

E GEC o,5(0 = (Egeco,4(i°) - aver,<E GEC o,4(i' a ))) / std,(E GEC o.4(/' a )) (eq. 56) 

which rescales the E GEC0 ,4(0 values to a number of standard deviations above or below 
the average value observed for a residue type a. The reason for the fourth 
transformation was to include a correction for the intrinsic level of difficulty to place 

20 different residue types at a given position, while the fifth transformation intended to 
correct for the intrinsic level of complexity to place a given residue type at different 
positions. A set of E GEC o,5(0 values for all residue positions i of a given protein 
structure and the 20 naturally occurring residue types a at each position is hereinafter 
called a transformed GECO profile for the protein involved. 

25 Target amino acid sequences were compared with transformed GECO profiles 

using a modified Smith- Waterman (SW) alignment procedure in accordance with R. 
Durbin et al. in Biological Sequence Analysis. Probabilistic models of proteins and 
nucleic acids (1998) at Cambridge University Press. The modification consisted in that 
no gap formation was allowed, which is equivalent to assigning an infinite gap penalty 

30 value. Since the SW approach requires that (i) a favorable alignment of a residue against 
a given position in a profile receives a positive score value (whereas favorable 
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E^tO values are negative) and (ii). he global expected value of score 
contributions shouid be negative (whereas the global expected value for Eobco* 
values is zero as a resuU of the normalization operations), two addttional 
transformations of the EWO values were needed. Firs, the sign of the Eobco,0 ) 
values was inverted to meet the first requirement: 

(eq. 57) 

E0ECO,«(O - -EOECOJO ) 

,„ addition, during construction of a SW alignment mafrix, a value of 0.5 was 
subtracted from the EWO vah.es to meet the second requirement This m— a 
residue oniy contributes in a favorable way to an alignment ,f ,ts EcW ) value .s 
XI thai 0.5 (i.e. if the log-transformed EcWO — » * - « " 
Lations below the average vatae a, position ,, However, since ft. , ti^sf— 
was oniy intended ,0 identify high-scoring sequence fragments on the basrs of max,ma 
in the alignment matrix, in accordance with the SW meutod, the non-transformed 
B^rfD values were used in a„ further captions. From the SW aiignmen. matnx, 
^highest-ring fragments were selected, being characterized by (i) an offset m 
ft. target sequence, (ii) an offset in the transformed GECO profile (in) » iengm , and 
(iv) a cumuUtive aHgnmen, score z being me sum of WO — rons for 
positions , in me profile and residue types a observed in the targe. ^ 

Tne obtained fragments were tamer statistically analyzed by emulating the 
, probability that they were found by pure chance. The probability «W ft* » 

hgned figment of a given .engd. 1 and consisting of uncorrelated, randomly ^ 
JL types has an aUgnmen, score equal to or better man a gwen score z can be 
derived follows. If two variables * and « are independent norma, 
average value of 0 and a standard deviation of I. here written as N(0,l), then met sum 
5 is distributed as K(0,V5), U. a distribution centered around 0, with a s«ndard 
deviation equa, to S , according ,0 Neuts in - Aft. and Bacon, 

mc (Boston). By extension, the sum of . such variab.es is distributed as N(0,V»). 
Then, the probability POv) that the sum of . such-variables is equal ,0 or greater man a 
value z, is given by the integration of the probability density function N(0, V* ) : 
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i ^ (eq. 

1 _g 2n Ay 



2iin 



dx 
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When applied to aligned fragments of length / having a cumulative score z, and 
for which the constituent residues have a normal distribution N(0,1), the probability that 
the said value z or any greater value is obtained by pure chance is given by equation 
(59): 

rl -' 2 

The latter equation allowed to assign a P-value to all high-scoring fragments, which 
P-value corresponds to the probability that one random positioning of a fragment of 
length / results in a score of at least z, which is different from the probability to obtain at 
least one fragment of length / with a score of at least z (in the latter case, the size of the 
target sequence and the profile need to be considered). The assignment of a P-value to 
each of the 50 highest-scoring fragments allowed to rank all fragments according to 
their significance. Finally, all fragments having a P-value below KT 4 were retained as 
potentially significant. 

The final global alignment was obtained by first clearing, in the aforementioned 
SW alignment matrix, all values for matrix elements not forming part of any retained 
fragment and, secondly, performing a Needleman-Wunsch (NW) protocol to find the 
best possible concatenation of the retained fragments (R. Durbin et al., cited supra). The 
NW protocol was performed using a gap opening penalty of 2.0 and a gap extension 
penalty of 0.2, resulting in a unique global alignment. The assessment of the 
significance of a global alignment was performed by summing the cumulative score for 
all fragments forming part of the global alignment, diminished by 2 x (the number of 
fragments - 1), further diminished by 0.2 x the total number of non-matched residue 
positions between matched fragments, inputting this value into equation (59) in order to 
obtain a global P-value and qualifying the result as a "positively recognized fold" if the 
global P-value is below 10 15 . Alternatively, a fold was also considered as "positively 
recognized" if the best fragment had a P-value below 10" 10 . This alignment procedure 
may not be optimal but that it forms a feasible and practically useful method to illustrate 
the potential of the ECO concept with respect to fold recognition. 

The aforementioned procedure for ECO-based analysis of the structural 
compatibility of a target amino acid sequence with a protein 3D structure was applied to 
a homologous pair of proteins from the C-type lysozyme family. Concretely, the 
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^enceofh.nan a.pha-UC.b^nwas aligned with «he Conned GECO 

, ,pnR code 3LZT1. The sequence identity between both 

profile of hen .ysozyme (PDB code 3LZI) 4 

is 40% for the best matching fragment of IB residues (out oi 
proteins is 40/. tor in aHgnment performed at the 

lactalbumin and 129 for lysozyme), according to a BLAb 1 al.gnm p 

Address htt p //www.ncbi.nlm.n>h.gov/. As a resu.t of the alignment procedure, 
rCdd were retauied having a P-value be.ow * (HO. 4). The best 
CI had rvalue of S.9 * «T» which already qualified the 1— targe, 

" as being — y compatible with the 3LZT fold. Three other fragmen* 
sequence as B iQ . 5 ^ fonned part 

n> values in ascending order were 1.5 x 10 , 6.3 x 1U ana 

fina, gioba, absent. Since the 30-s.cture ~^7^ 
rPHR code 1B90), a structural comparison with 3LZT was possioie. 

^ — ion P-dure succeeded in selecting on,, those fragments 
lit tdl similar in structure. Non-matched regions (between the fragment* 
I I Iserved typically in loop regions where one or more deletions o^ ^ 

On »he basis of this comparison, the alignment can be clearly 

Tet d Liar mlcture and were correctly excluded by die NW concatenation. 

i X ,o investigate the discriminative power of ^method with respect to no, 
the lactalbumin sequence has been randomly permuted 1000 

fc best scoting fragment with the lowest P- ; a,ue ~ — 
exneriments the best random fragment had a P-value of 1 .4 x 10 win gr 
Tl Mse positive fragments generally appearing in each alignment expenmen, 

correlation effects, i.e. tne mgu & „ m «le fraements 

„ * a .ment Since no structural similarity between the target and profile fragme 
random fragment, bince nu &u w;r >« a c vet not 

eonld be observed, die nature of die presumed 

nnderstood. Moreover, portions of this exceptional fragmer it also a ^ ? 
30 other false positive fragments (i.e. fragments 5, 6 and 9 in FIG. 4), alth 
value was typical for random fragments, i.e. around 10 . 

,„ conlion, these experiments show tha, the ECO concept is usefu, o*»* a 
structural —hip between two protein molecules by aligmng a target amino 
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sequence against a GECO profile. Moreover, a global structure-based 
alignment may be obtained. These results are obtained by only using single ECO's, 
more specifically the substitution energies of single residues in the context of a 
reference structure which may be up to 60% different in amino acid sequence compared 
5 to a target sequence. When suitably including information derived from double ECO's, 
the quality of the alignments should expectedly increase, since double ECO's contain 
information related to pair-wise residue/residue interactions. 

PROTEIN DESIGN APPLICATION 

10 The Bl domain of protein G (PDB code 1PGA) was chosen to simulate 100 

different randomly selected triple-residue substitutions and to compare the global 
energy obtained by the effective modeling of each triple-substitution with the global 
energy estimated on the basis of single and double GECO energies. This was not 
intended to predict the most stable variant but rather to show that protein design 

15 applications based on single and double (G)ECO's form a valuable alternative 
compared to previously known methods directly operating on a 3D structure, thereby 
being confronted with the combinatorial substitution problem. 

First the GMEC of the 1PGA structure was computed using the FASTER method 
(such as previously described) and the same rotamer library and energy function as in 

20 all other calculations (see Table 1). All residues having a rotatable side chain were 

included in the GMEC computation. The global energy of this structure, which was 

taken as the reference structure in all experiments related to 1PGA, was -149.3 kcal mol" 
i 

Three proximate residue positions being located in the core of 1PGA were selected 
25 to perform further design experiments. More specifically, residues 5 (LEU), 30 (PHE) 
and 52 (PHE), were chosen to be mutated into 100 different random combinations of 
residue types. Each combination is hereinafter called a mutated sequence and 
represented as {X,Y,Z} where X, Y and Z refer to the residue types placed at positions 
5, 30 and 52, respectively. The WT "mutation" {L£U, PHE, PHE} was added to this set 
30 as an extra test. Each mutated sequence was modeled into 1PGA exactly in the same 
way as the WT sequence, after having performed the necessary side-chain replacements. 
These modeling experiments were executed by a pure structure prediction method (i.e. 
the FASTER method) and not by an embodiment of the present invention: the 
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conformation of the mutated residues «> »o. kept fixed (in contrast to pattern,) but 
«. co.mode.ed with tha, of ail other rotatable side chain, Analysis of Ore results 
showed that tie global energies of the mutated sequences ranged ftom -149.3 «» 22274 
tal rnol 1 However, ail positive values were associated with sequences contammg a, 
teas, one PRO residue which is prohibited in a |3-shee« (positions 5 and 52) and m an a- 
nelix (position 30). The highest energy for sequences no. containing PRO was -85 5 
kcalmol- and was associated with the sequence {TRP.GLN.VAL} The sequences wrth 
a negative energy clustered around -121.6 ± 9.7 kcal mol ' (standard deviation). No 
sequence had an energy better than the WT sequence, no, surprisingly since only 100 
ou, of the 20' possible sequences were examined and since me WT strucmre . well- 

^ta order to exemplify the use of the ECO concept with respeet to protein design, it 
was investigated how the energies from the modeled sequences could be approximated 
by using only si„g.e and double GECO energies. For Oris purpose, all possrble smgle 
and double EDO's for the three selected positions were generated in agreement w,th the 
rotameric SPA embodiment of the present invention. The same experimental settings 
w ere applied as in the previous fold recognition experiments . ECO's were grouped ,n.o 

t • „ /e\ 9nr i rr<x\ but no transformation was 
GECO's in accordance with equations (5) and (IV) dux no 

performed on the resulting GECO energies. 

> From the sets of calculated single residue GECO energies, WO, anddoub 

. c (T.-a fy\ an eauivalent set of fuse terms, EfoseU ,J h wa!> 
residue GECO energies, E G eco(U ,J ih an equivai 

deduced by app.ying equation (47). The vah.es for all ft.se terms related to posmon^ 5 
and 30 are listed in Table 4. Mos, of the fuse terms are negative, wh,ch means « 
effectively modeled double rmttations are generally better man the mere sum of each 
,5 single substitution energy. Roughly speaxing, negative values are expected to artse ta. 
better interactions in a double substitution compared to each singre 
modeled into the reference structtrre, although a doub.e substitution may also be 
energetical more constrained than each of ti>e single mutations involve* fa , addmon 
some effects of non-additivity of adaption may play a role as we,.. Yet, the to. «- 
30 automatically correct for al. these effect so tha, together with the smgle GECO 
en^es, -hey can be used "bacKwards" to reconstruct a double mutation wtthou. 

actually needing to model it. 

The question of how equation (49) performs in the computation of the global 
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energy of multiply substituted proteins was then investigated as follows. Each of 
the 101 triplet mutations (100 random sequences plus the WT sequence) was calculated 
using equation (49) and these energies were compared to the energies resulting from the 
effective modeling in a correlation diagram (FIG. 5). It was observed that both data sets 
5 correlate surprisingly well, given the numerous possible sources of approximation. Out 
of the 101 sequences, 55 had a difference between modeled and calculated energy less 
than 1 kcal mol 1 in absolute value and 25 more had a difference less than 5 kcal mol' 1 . 
The present invention includes difference of less than 10 kcal mol" 1 . Moreover, in the 
most interesting range, i.e. for low-energy sequences, the correlation was even better. In 
10 the range up to 20 kcal mol" 1 above the energy of the WT sequence all 15 sequences 
differed in energy by less than 1 kcal mol* 1 in absolute value. Some larger differences 
were observed as well, but they were found exclusively in regions of higher energy. 
Even for the most constrained sequences containing at least one PRO (15 sequences; 
results not shown in FIG. 5), 14 had a difference in energy below 10 kcal mol 1 , while 
15 the energy of these modeled sequences was always higher than 400 kcal mol* 1 . While 
most energy values calculated from single GECO energies and fuse terms were roughly 
evenly distributed above and below the values from the modeled sequences, the largest 
differences were always positive, which may be due to non-additivity in the adaptations 
upon calculation of the double residue ECO's (leading to an overestimation of the fuse 
20 terms). For comparison, the energy of each sequence calculated as the sum of each 
involved single GECO energy (ignoring the fuse terms) was plotted in FIG. 5 as well. 
Even here some correlation with the effectively modeled energies can be observed but 
the percentual difference is significantly higher, which illustrates that the fuse terms 
contribute to the global energy in a most beneficial way. 
25 In conclusion, the estimation of the global energy of multiply substituted proteins 

may be performed to sufficient accuracy by using information derived from only single 
and double residue ECO's at least in the case studied, i.e. the core of the Bl domain of 
protein G. It is well-known in the art that core substitutions are among the most difficult 
ones since delicate packing effects are involved. Therefore, the present invention allows 
30 the design of new sequences, irrespective of the region where mutations are proposed. 
In addition, it is worth noting that the present embodiment to derive global energies 
from single and double residue GECO's is extremely fast when the latter values have 
been pre-ealculated and can be retrieved from a file or database. Therefore, the present 
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m e,hod may be usefu! to rapid* search a reduced se, of ^ 
sequences ou, of a combinatorial number of possibilities. Seances ob,a,ned thts way 
may then be submitted to more detailed analysis. ,-.„„, 
The present invention may be imp.emented on a computmg dev.ce e* a 

L reference structure and the reference amino acid sequence defined ,n step A of 
T t and also any further amino acid sequence, The computing dev.ce ,s adapted to 
^ ^ies ou, an, of the methods in accordance with the present 
rnlirrcomputer ma, be a server which is connected to a da, — 
Zl Lion means such as the Interne,. A scrip, ffle including the detatls of the 
" structure and me amino acid sequence ma, be sen. from one near.ca^ 
e., terminal, to a remote, i.e. second location, a, which the server res,*, ^ «~ 
relives mis data and outputs bade along the communicafons hne useful data to the 
" inal, e,. the alignment of a sequence re,ative to me — 
5 se, of favourable amino acid sequences for one or more resumes m me reference 

me claimed methods, a database or part of a database mciudmg sets of ECO 

lina, 14 a da. input means such as a aboard !6, and a graphvc user mterface 
nig means such as a mouse >«• — .0 ma, be imp— as a general 

M '^Z:: .0 — a Cental Process^ Unit ("CPU, 15 such as a 
eonventiona, microprocessor of which a Pentium .1. processor supphed by Intel Co*. 
ZZ only an example, and a number of other units interconnected vta svsKm »-£ 
Tcompu er 10 indudes a, ,eas, one memor,. Mepor, may include any of a van** * 
30 n ige devices Known ,0 the stilled person such as random-access 

read-only memory ("ROM"), non-volatile read/write memory such as a « 
^ Lwn to me sKUed person. For example, computer .0 ma, further mclude 
disc as tuiuwn /imrk\/ii\ nc well as an 

mn ™ f"RAM"1 24 read-only memory ( ROM ) 26, as weu 
random-access memory ( RAM ) z% rc , 
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optional display adapter 27 for connecting system bus 22 to an optional video display 
terminal 14, and an optional input/output (I/O) adapter 29 for connecting peripheral 
devices (e.g., disk and tape drives 23) to system bus 22. Video display terminal 14 can 
be the visual output of computer 10, which can be any suitable display device such as a 
5 CRT-based video display well-known in the art of computer hardware. However, with a 
portable or notebook-based computer, video display terminal 14 can be replaced with a 
LCD-based or a gas plasma-based flat-panel display. Computer 10 further includes user 
interface adapter 30 for connecting a keyboard 16, mouse 18, optional speaker 36, as 
well as allowing optional physical value inputs from physical value capture devices 40 
10 of an external system 20. The devices 40 may be any suitable equipment for capturing 
physical parameters required in the execution of the present invention. Additional or 
alternative devices 41 for capturing physical parameters of an additional or alternative 
external system 21 may also connected to bus 22 via a communication adapter 39 
connecting computer 10 to a data network such as the Internet, an Intranet a Local or 
15 Wide Area network (LAN or WAN) or a CAN. The term "physical value capture 
device" includes devices which provide values of parameters of a system, e.g. they may 
be an information network or databases such as a rotamer library. 

Computer 10 also includes a graphical user interface that resides within a 
machine-readable media to direct the operation of computer 10. Any suitable machine- 
readable media may retain the graphical user interface, such as a random access 
memory (RAM) 24, a read-only memory (ROM) 26, a magnetic diskette, magnetic tape, 
or optical disk (the last three being located in disk and tape drives 23). Any suitable 
operating system and associated graphical user interface (e.g., Microsoft Windows) may 
direct CPU 15. In addition, computer 10 includes a control program 51 which resides 
25 within computer memory storage 52. Control program 51 contains instructions that 
when executed on CPU 15 carries out the operations described with respect to the 
methods of the present invention as described above. 

Those skilled in the art will appreciate that the hardware represented in FIG. 6 
may vary for specific applications. For example, other peripheral devices such as optical 
30 disk media, audio adapters, or chip programming devices, such as PAL or EPROM 
programming devices well-known in the art of computer hardware, and the like may be 
utilized in addition to or in place of the hardware already described. 

In the example depicted in Fig. 6, the computer program product (i.e. control 
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computer storage 52. Ho» ^ ^ ^ rf 

- ^ * * " ^ — ed as a program product in a variety 

5 the present mvenfon are capable « ^ parlicl)lar W e 

of forms, and that <be present J ( £ Exarnples of 

of signai bearing media used » ^ ^ 
computer readabiesignaibearmgmed.amciude.recordaMe W 

disks and CD ROMs and transmission type medta such as dtgtta 

1 0 conununication links. 
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0 



TABLE 1 shows a compilation of single residue ECO's generated by the present 
method. Each line represents an ECO. The columns are labeled according to the list 
described in the section FORMAL DESCRIPTION OF ECO'S. Only basic ECO 
properties are shown. 3LZT is the PDB code for hen lysozyme; SD200: 200 steps 
steepest descent energy minimization; all H: CHARMM force field of Brooks et al. 
(cited supra), including explicit hydrogen atoms and TTS contributions as listed in 
Table 3; Flex : all "flexible" residues having a rotatable side chain; WT:res: WT residue 
"res" as observed in 3LZT; 3LZT__G: global minimum energy conformation for 3LZT. 
The value "1" in column III refers to the rotamer library of De Maeyer et al. (cited 
supra). 



0137147A2_I_> 



PCT/EPOO/10923 



WO 01/37147 



70 



_ _ n <r eo os ^ 



J^s=.M38B«iBia5=?S3i35«5gl8i«51!SailS55iE3?3 

ol ; n - "J n ^ vo ia o ^ 1 vo 




I <=> o 
: ^2 ^c^o 



, ^ in f> ct> 



^. n „b S 

; — so ^ 00 





>- S g £ S o u 



■ ^ 5 «ri .~ >0 ■? - ^ 



. «n r< „ ft 



— O 
rn O 0\ 



00 00 
o so o 





g§gSS§S|§S?SSS2 



8S8S§8S§SSSSI§1§SSSSSS S §§§§i§i 



V5| rS 



2^00000^ 
r4 o\ T ~ « 





g S ^ S ^ o w ^. o c. 



- 00 o 



£ S 5 § °: 2 p r-. S s ^ 



«sjill§2l8ill!?!a§§s5ii8?s515il^3? 



_ o* m *<r «"> >© r- 



WO 01/37147 



PCT/EPOO/10923 



71 



o ^ vo n 

vi . n rM vo 
r-" — K ^ ™ 



OO ^ n (N 



oo bo ^ 

^9 



*<r irt — oo 
r- „ 
^ T ^ oo — 



— • m — o\ — 
~ vO O On 
O (N W 



N a r- oo 
H oo oo — 
<=> «*t ^ ei 



r- — vo vo 

- X> « O 

— " TT (N >r> 



O JQ n fN O 

~ in <n 



^ ~ c . . 
o 0 "* ^ ^ 



^ CO O N oo 

vrj r-- o o\ — 
n O ye n 
O O fN — 



n n <n 
o o, m VO 

3 ^- °°- © © 

V o - N - 



«n O vO rr 
^ O — OO N 
P P P f*i " 

o\ o o oo m 



rt N |f 

OO ^> ^ 

<n — 

o o rsi o 



VQ ' — • OO VO 
- • » M - 

T o *^ m 



r- XI o O ^ 

r*«i oo ' f^* 



o\ os n* o £ 
tt — r- o 
oo — o o ^ 

o rn o 



2 o£ a. >- a: 
«iw JlU 
< h < O w 



2 
ig 

CO 
.G 



G 

o 
.g 

CO 

<u 
co 

CO 
CO 

'5b 
c 

O 

CJ 

o 

O 

-S 

s 

-*-» 
c 

CO 
CO 

CO 



CO 
-G 

>> 

CO 



co 
g 

CO 

w> 

CO 

CO 

'& 

<u 
c 
<u 

O 

o 

eo 

CO 

eo 



c 



c 
o 



CO 

c 



CO 

E 



CO 



£? 

CO 

c 

<L> 
O 

u 
o 

co 



IN) 
-) 

ro 

eo 

T3 
O 
O 

§ 

Oh 

eo 



CO 
<l> 

3 

IS 

oo 
CO 



O 

o 
w 
a 



cd 

CO +^ 



co 
H 

5: 



c 3 

co o 

co * 



CO 
CO 



2 S 

a> 



<o 

GO 

G 

CD 
CO 
X> 
CO 

> 

CO 
-G 



cr 

CO 

co 



c 

"E 

CO 

co 
o 



5 >^ ~ 



CO 
CO 
co 

CO 
-O 

C 

CO 
o 



O 

CO 

'I 



co 
CO 

E 

CO 
CO 



2 
3 

CO 

c 



co -G 

Si cS .52 



£2 
<o 

E 

CO 

2 

"co 

u> 

CO 

> 

O 



CO 

a. 
o 



CO 



c 
o 

'•^ 
'co 

o 

CO 
T3 



to 
3 



CO 



CO 
CO 

3 

o 

2 

CO 

C 

5 
o 
cx 

CO 
J3 



I 

o 



o 

CO 

C 

S 

O 

u 



*J3 is G -C 



CX .3 

E ^ 
o 

co 
CO 

CO 

O 
~C 

CO 

w 



G 

O 
O 
o 
CO 



CO 
CO 

12 

*co 

22 
o 



o 



CO 

CO 
CO 

CO 
co 



"t3 

CO 
CO 
CO 

22 
a, 
x 

CO 



CO 

o 



G 
O 



J2 ^ 



CO 

co 

* CO 

CO 

o 

CO 

co 

CO 

co 

CO 

§. 

CO 
CO 



CO ^ O w 



CO 

as 



co 

CO 

a> 
H 

CO 
-G 
• 

CO 

O 
-G 

co 

G 

B 

O 
O 

-a 
c 
o 

o 

CO 
co 

CO 

jo 

CO 

o 

s 

G 

CO 
CO 

o 
> 

CO 

~G 

o 



2 



O 
E 

"co ^ 

O CO 

^ a: 
•4— » 



t3 

to 

CO 

o 
c 



CO 

o 

G 

CO 

co 

-O 

CO 
CO 

CO 

3 
o 

2 

CO 
CO 

o 
c 

22 

c2 

CO 

1-1 

CO 



CO 
CO 

co 
X 

CO 



o 
E 

CO 

o 
*^ 

o 
o 



CO 
co 



CO 
CO 
G 
*co 

2 

CO 
N 



CO 
G 

cr 

CO 



CO 

c 
o 



CO 

i i 

CO 



CO 

co 
G 
O 



CO 
O 

o 



3 is 



G 

co 

CO 
>« 

CJ 



3 w ^ 



CO 
3 
CO 
> 
CO 



co 



CO 

-co 

CO 
G 
CO 



CO 

CO 
CO 

> 



CO 
CO 
CO 

o 

pG 

O 



=3 ^ 



CO 

o 

G 

CO 
G 

cr 

CO 
CO 



c 

O 



O 



BNSDOCID: <WO_ 



_0137147A2_I_> 



WO 01/37147 



72 



PCT/EPOO/10923 



r*1 

m 
< 




WO 01/37147 



73 



PCT/EP00/10923 



W 

QQ 
< 



CO 

5c 

i 



CO 



2: 
a 



fr- 
et) 



2: 



CO 



?5S 

- 9 *? T o o 



*^ooo— *o"9oo*9*o" < ***>o^ 



00 



ON 



2 K 2 <*> <o -* 



25§P;S§US;2S^ 

»n VO 



O* ~ co T *T 0 ^ — • — • 

1 • • 1 1 » 1 1 1 



r-- ro o\ •— - co O { — r- 
cMCM voONroONrrro 
(N ^ ^ ri ^ (N 



o°^OTr{Qr^^*o 

fv) rs (N lO (N - 



co — — • 



on co — 00 
On — « 



9 T *T T T T **V T o — cm — ' 0 - p o ro 



ro co 



rr o 00 m 

- ON <N CO Tj- 

°? T 



00 o\ 21 no on r^* 
* » , • 1 1 



ro 

00 



o 
o 

o -r 



»o cm ro — ^ cm 

O CM CM O ON O 

^ ? t T t' 



o» S ^ - 
00 o CO 
o* <n n - 



c^ 00 co 
*n co cm no 

n K 06 00 



OONO t^— - p n© tn ■*— 



rr o — 

ro 00 00 

o o* ~ 



cm ro 
o o 



n 00 (N 
cm cm ro on 



cm cm o* o — * o ? — *■ 



as 



CO 



a: 



o 



a 



o 



82 £ !2 0 00 ^ 

*"T ' C V 'Pi 0 cm 

» * CM 1 • » 



Tj-mo^oo^fM 
cMr^ocsjoqcMCMcM 

O CM O" «T* CM CN I «fj CM' 
» 1 t CM » 



00 c** o 
o — : ^ -^r 

O* CN CM* O O 



O _ CM ON 

r-. on ^ rj- r- 
co cm" 



CM 

— * cm co* 



— 00 o 
^ ^ o 
T O o cm rn 



^*°«oon«>»on©~ 
on °o 00 fvi 

cm" CM* ~ 



ro 
o 



\ t \o n 
<? 9 o d 



P °. °. 0 o o 
000000 

000000 



O ^ On C^ ^ 

- ^ ^ S o - 
— * o en o o* o* 



r* !2 2 00 00 o 

r*- ^ ^ <n v© ~- 
- 9 <? o 0 <=> 



SSSSP°°°oooooo 
00000000000000 
000 00000000 00 



CM 



^^ocMco^ONinoococMi/Jo " 
^ o o' o* o *7 9 CM o O O o* 



ON 

o ^ 



NO vo 



o 
o 



o\ r^- fN ~ 
NO O tJ- On 00 O 



o — • 



O _ 



O O *— * O «— • r** CM O „; " 



ONr^t^vo — N£?ON»r» 
r^r-^—-ror--»orocM 
<n 0\ >n irj in iri 



o 

cm — 



O ~ 

t^* tn 00 



00 o 

TJ- ON 

cm 



7- no cm 00 *n 
On o cm o r*- o 

»n t*-» o" vo* vo vo 



pNOOr^oRSCJO^uoincM 
oOO^J^^OOw^r^CMNO^o 
<>jOO~!L:r^o^oo 0 -ior^r^ 



ON on CM q o 

^ 10 ^ o — 
o 9 T * o o 



CM 



Tf C\ (S ^ OO „ 

»o <o ^ co 0 
o 9 <"? o 9 o 



CM O NO VO 

o ^ *^ *o 
06 o* o* 



o o 

£r o o 

^ CO O 

V o o 



On 

OON^OO— *CMCM™ 

O^ O^r^Jooo^C? 

— 9 — : vd co o Q 



o 00 
o 



cm ON 

o 00 o 

CM O — 



co ro co 00 On On 
cm in on in 00 rr 
o 



O O CM O -7 



r- £n o 
»n 00 o o 

o o 9 — * 



on ro 
»n 



o — 9 ^ T 9 o ^ CN i o* 



no r- °5 mo 00 ro 

On *z o r*» CM 

tn 2 ^o no no" 



00 cm o 



^0 0*cMinroroo 0 *-- ^__;o 



CM 



- •_-.pcM-- Ol0S0 o r MSr-«»0N 



000 



^ p 

O ro 



OOCMOOrOWOr-VOON 
— ^pOOCMONCMr^CM 

p ~ ro 06 o o r^- 



O — < ro 
o" o" o 



CM* O 



ro ro 



^^^^^^^ooocMro^no 

— ONOOr^^OONONgg^Pro^S 

09^^00x0^0-55 * • 



. CM 

o v 

n fSi tJ- 

^ cm r- 



CO ^ wo 



*? *9 co ~ <7 — ; co 



cm 

— CO 

T *¥ 



> < O ^ a: w 

PES < W >r 



O < ou > 



O co 

5 2 



-o 
o 
o 

OQ 
Q 

O 

-S 
"(3 

2 



E 

o 



o 

co 

•a 
c 

CO 

C 

o 



O 

a- _ . 

^ ° 

CO 



.2 -S 



CO 



CO 
CO 



3 „ 

c x: 

I 2 

o ra 

-a §3 

_4J .3 

JO « 

'tn > 

CO 



CO 

,0 



cr 

<L> 



co 

5 8 



CO 



•a § 

o 

J? CM 

co *— * 

^ •§ 

P-J <^ 

3 o. 



BNSDOCID: <WO 0137147A2_L> 



WO 01/37147 



PCT/EPOO/10923 



74 

CLAIMS 

1 A method for analyzing a protein structure, the method being executable in a 
computer under the control of a program stored in the computer, and comprismg the 
following steps: 

5 A receiving a reference structure for a protein, whereby 

. the said reference structure forms a representation of a 3D structure of the saul 

protein, and 

- the protein consists of a plurality of residue positions, each carrying a parfccular 
reference amino acid type in a specific reference conformation, and 
10 - the protein residues are classified into a set of modeled residue positions and a set 

of fixed residues, the latter being included into a fixed template; 
B substituting into the reference structure of step (A) a pattern, whereby 

- the said pattern consists of one or more of the modeled residue positions defined 
in step (A), each carrying a particular amino acid residue type m a fixed 

15 conformation, and . 

- the one or more amino acid residue types of the said pattern are replacmg the 
corresponding amino acid residue types present in the reference structure; 

C. optimizing the global conformation of the reference sttucture of s*p (A) beurg 
substitutedbythepattemofstepfB), whereby 
20 - a suitable protein statute optimization method based on a function allowmg to 

assess the nuality of a gioba! protein structure, or any par, thereof, is used rn 
combination with a suitable conformational search method, and 
. the said sttuctttre optimization method is applied to all modeled residue postUons 
defined in step (A) no, being locked a, any of the pattern residue postttons 

25 defined in step (B), and 

-the pattern and template residues are kept fixed; 

D assessing the energetic compatibility of the pattern defined in step (B) wtthm m 
' context of the reference structure defined in step (A) being structurally opt.rn.zed 
in step (C) with respect to the said pattern, by way of comparing the global 
30 energy of me substituted and optimized protein structure with the global energy 

of the non-substituted reference structure; and 
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E. storing a value reflecting the said energetic compatibility of the pattern 
together with information related to the structure of the pattern in the form of an 
energetic compatibility object. 

5 2. A method according to claim 1, wherein the reference structure of step (A) further 
represents the 3D structure of ligands including co-factors, ions or water molecules. 

3. A method according to claim 1 or claim 2, wherein the reference structure of step (A) 
is received from a database of protein structures. 



4. A method according to any of claims 1 to 3, wherein the reference structure of step 
(A) has undergone a further energy optimization step by means of a molecular 
mechanics gradient method. 



15 5. A method according to any of claims 1 to 4, wherein the reference structure of step 
(A) assumes a reference amino acid sequence being the WT sequence of the said 
protein. 

6. A method according to any of claims 1 to 4, wherein the reference structure of step 
20 (A) assumes a reference amino acid sequence being the sequence of lowest energy for 

the said protein. 

7. A method according to any of claims 1 to 4, wherein the reference structure of step 
(A) assumes a reference amino acid sequence being a polyglycine or a polyalanine 

25 sequence. 

8. A method according to any of claims 1 to 4, wherein the reference structure of step 
(A) assumes any possible reference amino acid sequence including a random or a 
dummy sequence. 

30 

9. A method according to any of claims 1 to 8, wherein the reference structure of step 
(A) has been modeled by the structure optimization method used in step (C). 
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r i to o wherein the set of modeled residue 

10. A method according to any of claims 1 to 9, wherein ine 

positions comprises at least one position. 

„. A method acting .0 any of claims 1 » 9. wherein the set of modeled residue 
5 positions comprises at least three positions. 

,2. A method according to any of claims 1 to 9. wherein the se, of modeled residue 
positions comprises at least five positions. 

,0 13 A method according to any of claims 1 ,0 12, wherein the pattern of step G» 
consists of a single residue position, the said partem being referred to as a single restdue 

pattern. 

,4 A method according to any of claims 1 to 12, wherein the pattern of step (B) 
„ consists of a pair of residue positions, the said pattern heing referred to as a smgle 

residue pattern. 

„. A method according to any of claims 1 to .2, wherein the pattern of step (B) 
consists of a series of consecutive, interconnected residue positions, the sard pattern 

20 being referred to as a loop pattern. 

16 A method according to any of claims 1 to 15, wherein each of the pattern residue 
positions denned in step (B) are assigned a naturally occurring amino acid residue type. 

25 ,7 A tn«hod according to any of claims 1 to 15, wherein each of the pattern residue 
positions defined in step (B) are assigned a naturally or a non-naturally occurrmg ammo 

acid residue type. 

,S. A method according ,0 any of claims 1 to .7, wherein each P°™^ 
30 positions carrying a particular ammo acid residue type as defined ,n step (B) are 
assigned a fixed conformation retrieved from a protein structure database. 
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19. A method according to any of claims 1 to 17, wherein each of the pattern residue 
positions carrying a particular amino acid residue type as defined in step (B) are 
assigned a fixed conformation which is computer-generated. 

5 20. A method according to any of claims 1 to 17, wherein each of the pattern residue 
positions carrying a particular amino acid residue type as defined in step (B) are 
assigned a fixed conformation which is received from a rotamer library, the said pattern 
being referred to as a rotameric pattern. 



10 



21. A method according to claim 20, wherein the said rotamer library is backbone- 
independent, the said library being referred to as a side-chain rotamer library. 



22. A method according to claim 20, wherein the said rotamer library is backbone- 
dependent, the said library being referred to as a combined main-chain/side-chain 

15 rotamer library, but wherein the main-chain information in the library is only used in 
order to assign a side-chain conformation to each of the pattern residue positions 
carrying a particular amino acid residue type as defined in step (B) which is structurally 
compatible with the local main-chain conformation of each corresponding residue 
position in the reference structure of step (A). 

20 

23. A method according to claim 20, wherein the said rotamer library is backbone- 
dependent, the said library being referred to as a combined main-chain/side-chain 
rotamer library, and wherein each of the pattern residue positions carrying a particular 
amino acid residue type as defined in step (B) are assigned a fixed conformation in 

25 accordance with a full main-chain/side-chain rotamer known in the said rotamer library 
for the corresponding amino acid residue type located at each pattern residue position. 

24. A method according to any of claims 1 to 23, wherein the said cost function of step 
(C) is an energy function allowing to assess the potential or free energy of a global 

30 protein structure or any part thereof. 

25. A method according to any of claims 1 to 24, wherein the said conformational 
search method of step (C) includes a DEE method. 
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e „i„;,r,e i tr. 94 wherein the said conformational 
26 A method according to any of claims 1 to 24, wnerein 

search method of step (C) includes a method comprising the steps of: 
(b) receiving a 3D representation of the molecular structure of a protem, the said 
5 representation comprising a first set of residue portions and a template; 

(b) modifying the representation of step (a) by at least one optimization cycle; 
characterized in that each optimization cycle comprises the steps of: 

(bl) perturbing a first representation of the molecular structure by modifying 
the structure of one or more of the first set of residue portions; 
10 (b2) relaxing the perturbed representation by optimizing the structure of one 

or more of the non-perturbed residue portions of the first set with respect 
to the one or more perturbed residue portions; 
(b3) evaluating the perturbed and relaxed representation of the 
molecular structure by using an energetic cost function and 
! 5 replacing the first representation by the perturbed and relaxed 

representation if the latter's global energy is more optimal than that 
of the first representation; and 
(c) terminating the optimization process according to step (b) when a 
predetermined termination criterion is reached; and 
20 (d) outputting to a storage medium or to a consecutive method a data structure 

comprising information extracted from step (b). 

j. , mu of claims i to 24, wherein the said conformational 
27 A method according to any of claims l to 

search method of step (Q includes a Monte Carlo simulation method. 

^ 28. A method according to any of claims 1 to 24, wherein the said conformational 
search method of step (C) includes a genetic algorithm method. 

29. A method according to any of claims 1 to 24, wherein the said conformational 
30 search method of step (C) includes a mean-field method. 



WO 01/37147 



PCT/EP00/10923 



5 



79 

30. A method according to any of claims 1 to 24, wherein the said conformational 
search method of step (C) includes a molecular mechanics gradient method such as a 
steepest descent- or a conjugated gradient energy minimization method. 

31. A method according to claim 30, wherein the said template residues are not kept 
fixed but are allowed to vary by at most 2 angstroms in root-mean-square deviation 
compared to the reference structure of step (A). 

32. A method according to claim 30, wherein neither the said pattern residues nor the 
10 said template residues are kept fixed but wherein they are allowed to vary by at most 2 

angstroms in. root-mean-square deviation compared to the substituted reference structure 
of step (B). 

33. A method according to any of claims 1 to 29, wherein the residue positions to which 
1 5 the structure optimization is applied as defined in step (C) are rotameric, in that they are 

allowed to assume conformations received from a rotamer library in the same way as 
occurs for rotameric pattern residues. 

34. A method according to any of claims 1 to 33, wherein the said energetic 
20 compatibility of the pattern is calculated as the difference in global energy between the 

substituted and optimized protein structure of step (C) and the reference structure of 
step (A). 

35. A method according to any of claims 1 to 33, wherein the said energetic 
25 compatibility of the pattern is calculated as the absolute global energy of the substituted 

and optimized protein structure of step (C), this being equivalent to selecting in step (A) 
a reference amino acid sequence such as a polyglycine sequence or a dummy sequence. 

36. A method according to any of claims 1 to 35, wherein the self-energy of the 
30 template is ignored. 

37. A method according to any of claims 1 to 36, wherein the self-energy of the protein 
backbone is ignored. 
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38 A memo, according ,o any of claims 1 to 37, wherein the ECO formed ,n step (E) 
further inches an unambiguous description of the pattern in terms of the number of 

5 structure, .he — of an amino acid residue type a, each posttion and the 
conformation of each pattern residue. 

„ , ■ i .„ is wherein the ECO further includes a 
39 A method according to any of chums 1 to 38, wherein the * 

description of or a reference to the protein structure received ,n step (A). 

10 40. A method according to an, of Cairns 1 to 39, wherein the ECO fur.er mc.udes a 
ascription of or a reference to the modeled residue positions defined ,n step (A). 

«. A method according to any of Cairns , to 40, wherein the «~ ~ » 
15 description ofor a reference to the amino acid reference seauence defined ,n step (A). 

a- .„ m „f claims 1 to 41, wherein the ECO further includes a 
42. A method according to any of claims , 

description of or a reference to the structure optimization method used m step (C). 

20 43 A method according to any of Cairns 1 to 42, wherein the ECO further includes a 
\ ,■ for a reference to the rotamer library used to assign the conformation of 

Ifltiona, — of the modeled, non-patten, residues in accordance wtth step (Q. 

a- t anv of claims 1 to 43, wherein the ECO further includes a 
25 44 A method according to any of claims 1 xo ««, 

description of or a reference to the cost function used in step (C). 

45 A method according to any of claims 1 to 44, wherein the ECO «^ 
dlipUon of or a reference to the substituted and optimized protein structure obtained 

30 in step (C). 

f ,W*™ 1 to 45 wherein the ECO further includes 
46. A method according to any of claims 1 to 45, wner 
additional information derived from either the reference structure of step (A) 
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substituted and optimized structure of step (C), such as the local secondary structure 
of each modeled residue and/or its accessible surface area. 

47. A method according to any of claims 1 to 46, wherein steps (B) to (D) are repeated 
5 for different patterns defined in step (B). 



48. A method according to claim 47, wherein the patterns are single residue patterns 
located at one single residue position selected from the set of modeled residue positions 
of step (A) and are formed by systematically considering all possible combinations of 

1 0 amino acid residue types in different conformations, and wherein the amino acid residue 
types are selected from a pre-defined list of residue types and the said conformations are 
selected from a pre-defined list of conformations. 

49. A method according to claim 47, wherein the patterns are double residue patterns 
1 5 located at two different single residue position selected from the set of modeled residue 

positions of step (A) and are formed by systematically considering, for the pair of 
selected residue positions, all possible combinations of amino acid residue types in 
different conformations and wherein the amino acid residue types are selected from a 
pre-defined list of residue types and the said conformations are selected from a pre- 
20 defined list of conformations. 



50. A method according to claim 48, being systematically repeated until each single 
residue position of the set of modeled residue positions of step (A) has been selected as 
a pattern residue position. 

51. A method according to claim 49, being systematically repeated until each possible 
pair of residue position of the set of modeled residue positions of step (A) has been 
selected as a pair of pattern residue positions. 

52. A method according to claim 51, wherein the number of considered double residue 
patterns is restricted in accordance with a method taking into account a measure of the 
distance between two residue positions in the reference structure of step (A). 
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53 A method according <o data 52, wherein me number of considered double 
residue patterns is res.ric.ed in accordance wi«h a method further taking in.o accoun. a 
measure of the size of the amino acid residue types placed a. each of both pattern 

residue positions. 

5 54 A method according .o any of claims 1 .0 53, wherein the quantitative measure 
representing the energetic compatibility of a pattern within me con«ex, of a substituted 
and adapted reference structure is further transformed. 

' i * za «rW^m the said transformation is effected by 
10 55. A method according to claim 54, wherein tne saw tm 

means of a linear, logarithmic or exponential function. 

56. A method according to any of claims 47 to 55, wherein the ECQ-s are further 
^uped into OECO's and wherein mis grouping operation occurs on the basis of die 
„ qltitative measure representing the energetic compatibi.uy of different panerns bemg 
located a. the same residue position© and assuming the same residue type(s). 

57 A fold recognition method ,0 identify a potential structural relationship between a 
particular targe, amino acid sequence and one or more protein 3D structures me said 

20 pro.ein 3D structures being analyzed by a method according to any of Cairns 1 to 56. 

58 An inverse folding meuiod to identify a potential structura! relationship 
particular protein 3D structure and one or more Known amino acid sequences wnerem 
the said protein 3D stmcture is ana.yzed by a method according any of clanns 1 ,0 56. 

M 59. A protein design method ,0 identify or generate amino acid sequences which are 
energetically compatible with a particuUr protein 3D structure, wherein the said protetn 
3D structure is analyzed by a method according any of claims 1 to 56. 

<• „i„; m c i to 56 wherein the cost function includes an 
30 60. A method according to any of claims 1 to 56, wnerem 

energetic contribution accounting for main-chain flexibility. 

„ „„„ „ f claims 1 to 56, wherein the energy function includes 
61. A method according to any ot claims i iu ' 
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an energetic contribution accounting for solvation effects. 

>. A method according to claim 61, wherein the energetic contribution for solvation 
effects is calculated by a method including the assignment of a set of energetic 
solvation terms to each residue type depending on the degree of solvent exposure of 
its respective rotamers at the considered residue positions in the protein structure. 

. A method according to claim 61, wherein the energetic contribution for solvation 
effects is a type-dependent, topology-specific solvation method including 
establishing a set of energetic parameters for each of different classes of solvent 
exposure. 



15 



64. A method according to claim 63, wherein the classes of solvent exposure correspond 
with buried, semi-buried and solvent-exposed rotamers, respectively. 



65. A method according to claim 63 or claim 64, wherein: 

- first, each residue type at a given residue position in a specific rotameric state is 
substituted into the protein structure and its accessible surface area (ASA) is 
calculated, 

20 - next, a class assignment occurs on the basis of the percentage ASA of the residue 
side chain in the protein structure compared to the maximal ASA of the same side 
chain being shielded from the solvent only by its own main-chain atoms, and 

- finally an appropriate type- and topology-specific energy, Etts(T,C), for the 
considered pattern element is retrieved from a table of TTS values, using the pattern 

25 type (T) and class (C) indices,. 

66. A method according to claim 65, wherein the table of TTS values is established by: 

- first considering a set of a number of well-resolved protein structures and assigning 
to each residue position of this set a unique index /, 

30 - then considering, at each residue position z, all residue types a in all rotameric states 
r, thereby forming the set of all possible position-type-rotamer combinations i°, 

- defining by /* the WT residue type at position i in a rotameric state r as observed in 
the protein comprising position /, 
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- assigning to each i a r a solvent exposure class CtfX 

- calculating for each * the value E^OV*) as the total potential energy the rotamer , r 
experiences within the fixed context of the protein structure comprising i, 

- defining a function AE(i) according to equation (2) 

5 AE(0 = min r W?) + Ww,C0D)) - nun* mhv (W^) + W^A*))) 

- AE(/) being the difference in energy of the WT residue type in its best possible 
rotamer and the energetically most favorable residue type at residue position , 
also in its best possible rotamer, given a set of TTS parameters, and 
- applying a suitable parameter optimization method to maximize S by varymg 
10 Etts(T,C) parameters. 

67 A type-dependent, topology-specific solvation method for the assignment of a set of 
energetic solvation terms to a set of residue types, depending on the degree of solvent 
exposure of their respective rotamers at the considered residue positions in a protein 

15 structure, wherein: . 
. first, each residue type a, a given residue position in a specific rotan,enc ««, is 
subs.ta.ed into *e protein structure and its accessible surface area (ASA) ,s 

calculated, 

. next, a class assignment occurs on the basis of the percentage ASA of the resrdue 
20 side chain in the protein structure compared to the maxima. ASA of the same s,de 
chain being shielded from .he solvent only by its own main-cham atoms, and 
- finally an appropriate type- and topology-specific energy, ErrsOT.C), for the 
considered pattern element is retrieved from a table of TTS values, using the pattern 
type (T) and class (C) indices. 

25 68 A method according to claim 67, wherein the table of TTS values is established by: 
. ' firs, considering a set of a number of well-resolved pro^in shuctures and asstgmng 

to each residue position of this set a unique index f, 
. then considering, a. each residue position all residue types . in all roUmer.csu.tes 
30 r terebyfonningmese.ofallpossibleposuion-type-rouunercomb.nat.ons.,, 

- defining by ,7 <he WT residue type a. position / in a rotameric sate r as observed m 
the protein comprising position i, 
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- assigning to each /" a solvent exposure class C(/' r a ), 

- calculating for each i? the value E po ,(/;) as the total potential energy the rotamer i a r 
experiences within the fixed context of the protein structure comprising /, 

- defining a function AE(/) according to equation (2) 

5 AE(/) = min r (Ep.^ + Etts(w,C07))) - min, min r (E pot (/ r fl ) + ETrs(a,C(^)) (eq.2 

AE(/) being the difference in energy of the WT residue type in its best possible 
rotamer and the energetically most favorable residue type at residue position / 
also in its best possible rotamer, given a set of TTS parameters, and 

- applying a suitable parameter optimization method to maximize S by varying 
10 Etts(T,C) parameters. 

69. A nucleic acid sequence encoding a protein sequence analyzed by a method 
according to any of claims 1 to 56. 

70. An expression vector comprising the nucleic acid sequence of claim 69. 
15 7 1 . A host cell comprising the nucleic acid sequence of claim 69. 

72. A pharmaceutical composition comprising a therapeutically effective amount of a 
protein sequence analyzed by a method according to any of claims 1 to 56 and a 
pharmaceutically acceptable carrier. 

73. A method of treating a disease in a mammal, comprising administering to said 
20 mammal a pharmaceutical composition according to claim 72. 

74. An ECO obtainable by a method according to any of claims 1 to 56. 

75. A database in the form of a data structure comprising a set of ECO's obtainable by 
a method according to any of claims 1 to 56. 

25 76. A computing device for analyzing a protein structure, comprising: 
means for receiving a reference structure for a protein, whereby 
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- the said reference structure forms a representation of a 3D structure of the said 
protein, and 

- the protein consists of a plurality of residue positions, each carrying a particular 
reference amino acid type in a specific reference conformation, and 

5 - the protein residues are classified into a set of modeled residue positions and a set 

of fixed residues, the latter being included into a fixed template; 
means for substituting into the reference structure of step (A) a pattern, whereby 

- the said pattern consists of one or more of the modeled residue positions defined 
in step (A), each carrying a particular amino acid residue type in a fixed 

10 conformation, and 

- the one or more amino acid residue types of the said pattern are replacing the 
corresponding amino acid residue types present in the reference structure; 

means for optimizing the global conformation of the reference structure of step (A) 
being substituted by the pattern of step (B), whereby 
15 - a suitable protein structure optimization method based on a function allowing to 

assess the quality of a global protein structure, or any part thereof, is used in 
combination with a suitable conformational search method, and 

- the said structure optimization method is applied to all modeled residue positions 
defined in step (A) not being located at any of the pattern residue positions 

20 defined in step (B), and 

- the pattern and template residues are kept fixed; 

means for assessing the energetic compatibility of the pattern defined in step (B) within 
the context of the reference structure defined in step (A) being structurally optimized in 
step (C) with respect to the said pattern, by way of comparing the global energy of the 
25 substituted and optimized protein structure with the global energy of the non-substituted 
reference structure; and 

memory for storing a value reflecting the said energetic compatibility of the pattern 
together with information related to the structure of the pattern in the form of an 
energetic compatibility object as a data structure. 
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77. A computing device for computing a type-dependent, topology-specific solvation 
for the assignment of a set of energetic solvation terms to a set of residue types, 
depending on the degree of solvent exposure of their respective rotamers at the 
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considered residue positions in a protein structure, comprising: 

- Means for substituting each residue type at a given residue position in a specific 
rotameric state into the protein structure and its accessible surface area (ASA) is 
calculated, 

5 - Means for a class assigning on the basis of the percentage ASA of the residue side 
chain in the protein structure compared to the maximal ASA of the same side chain 
being shielded from the solvent only by its own main-chain atoms, and 

- Means for retrieving an appropriate type- and topology-specific energy, Etts(T,C), 
for the considered pattern element from a table of TTS values in the form of adata 

10 structure, using the pattern type (T) and class (C) indices. 

78. A computer program product to be utilized for computing on a computing system 

with a processor and memory, comprising: 
instruction means for receiving a reference structure for a protein, whereby 

- the said reference structure forms a representation of a 3D structure of the said 
protein, and 

- the protein consists of a plurality of residue positions, each carrying a particular 
reference amino acid type in a specific reference conformation, and 

- the protein residues are classified into a set of modeled residue positions and.a set 
20 of fixed residues, the latter being included into a fixed template; 

instruction means for substituting into the reference structure of step (A) a pattern, 
whereby 

- the said pattern consists of one or more of the modeled residue positions defined 
in step (A), each carrying a particular amino acid residue type in a fixed 

25 conformation, and 

- the one or more amino acid residue types of the said pattern are replacing the 
corresponding amino acid residue types present in the reference structure; 

instruction means for optimizing the global conformation of the reference structure 
of step (A) being substituted by the pattern of step (B), whereby 

- a suitable protein structure optimization method based on a function allowing to 
assess the quality of a global protein structure, or any part thereof, is used in 
combination with a suitable conformational search method, and 
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-the said structure optimization method is applied to all modeled residue 
positions defined in step (A) not being located at any of the pattern residue 
positions defined in step (B), and 

- the pattern and template residues are kept fixed; 

5 instruction means for assessing the energetic compatibility of the pattern defined in step 
(B) within the context of the reference structure defined in step (A) being structurally 
optimized in step (C) with respect to the said pattern, by way of comparing the global 
energy of the substituted and optimized protein structure with the global energy of the 
non-substituted reference structure; and 

10 instruction means for storing a value reflecting the said energetic compatibility of the 
pattern together with information related to the structure of the pattern in the form of an 
energetic compatibility object as a data structure inmemory. 

79 A computer program product to be utilized with a computer system having a 
15 memory and a processor for computing a type-dependent, topology-specific solvation 
for the assignment of a set of energetic solvation terms to a set of residue types, 
depending on the degree of solvent exposure of their respective rotamers at the 
considered residue positions in a protein structure, comprising: 

- Instruction means for substituting each residue type at a given residue position m a 
20 specific rotameric state into the protein structure and its accessible surface area 

(ASA) is calculated, 

. Instruction means for a class assigning on the basis of the percentage ASA of the 
residue side chain in the protein structure compared to the maximal ASA of the 
same side chain being shielded from the solvent only by its own main-chain atoms, 

25 and 

- Instruction means for retrieving an appropriate type- and topology-specific energy, 
Etts(T,C), for me considered pattern element from a table of TTS values in the form 
of adate structure, using the pattern type (T) and class (C) indices. 

30 80 A computer readable data carrier comprising an executable computer program 
product in accordance with claims 78 or 79 or for executing any of the methods of 
claims 1 to 66. 
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81. A method comprising the steps of 

inputing a description of a least a protein reference structure at a near location; 

transmitting the description to a remote processing engine running a computer 
program for carrying out any of the methods of claims 1 to 66; 

receiving at a near location from the remote processing engine an output of a 

method in accordance with any of the claims 1 to 66. 
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